This is the online appendix of the paper Rank-normalization, folding, and localization: An improved \(\widehat{R}\) for assessing convergence of MCMC available on Github (https://github.com/avehtari/rhat_ess). Here, we provide all the compute code related to the examples presented in the paper and more numerical experiments not discussed in the paper itself.
To help you finding your way through all the examples presented in this online appendix, below please find a list of links to the examples discussed in the paper:
In this section, we will go through some examples to demonstrate the usefulness of our proposed methods as well as the associated workflow in determining convergence. Appendices B-E contain more detailed analysis of different algorithm variants and further examples.
First, we load all the necessary R packages and additional functions.
library(tidyverse)
library(gridExtra)
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
library(bayesplot)
theme_set(bayesplot::theme_default(base_family = "sans"))
library(rjags)
library(abind)
source('monitornew.R')
source('monitorplot.R')
This section relates to the examples presented in Section 5.1 of the paper.
Traditional \(\widehat{R}\) is based on calculating within and between chain variances. If the marginal distribution of a chain is such that the variance is infinite, this approach is not well justified, as we demonstrate here with a Cauchy-distributed example. The following Cauchy models are from Michael Betancourt’s case study Fitting The Cauchy Distribution. Appendix C contains more detailed analysis of different algorithm variants and further Cauchy examples.
The nominal Cauchy model with direct parameterization is as follows:
writeLines(readLines("cauchy_nom.stan"))
parameters {
vector[50] x;
}
model {
x ~ cauchy(0, 1);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run the nominal model:
fit_nom <- stan(file = 'cauchy_nom.stan', seed = 7878, refresh = 0)
Warning: There were 1421 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems
We get HMC specific diagnostic (Betancourt, 2017) warnings about a very large number of transitions after warmup that exceed the maximum treedepth and low estimated Bayesian fraction of missing information, indicating slow mixing likely due to very long tails of the Cauchy distribution.
mon <- monitor(fit_nom)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
x[1] -5.87 0.02 6.78 0.92 13.64 1.01 1972 356
x[2] -5.76 -0.03 5.11 -0.27 7.53 1.00 2411 797
x[3] -6.12 0.00 7.73 0.77 10.28 1.01 440 112
x[4] -5.30 -0.01 5.41 -0.06 7.19 1.01 2263 691
x[5] -13.14 -0.04 5.83 -2.78 17.58 1.03 167 44
x[6] -8.10 -0.06 6.39 -0.92 15.59 1.01 1603 398
x[7] -7.25 -0.08 7.55 -0.29 9.68 1.01 1220 457
x[8] -4.79 0.02 6.23 0.57 8.65 1.02 1559 428
x[9] -4.69 0.01 4.82 0.13 5.43 1.00 2603 778
x[10] -8.43 -0.04 6.50 -1.92 32.96 1.01 1599 257
x[11] -6.21 0.02 7.09 0.43 16.21 1.01 2458 469
x[12] -5.55 0.02 6.62 0.17 6.37 1.00 2830 398
x[13] -6.16 0.00 6.13 0.44 12.78 1.01 2277 561
x[14] -5.45 0.02 6.79 0.29 8.44 1.01 3118 642
x[15] -8.17 0.05 9.55 0.26 30.61 1.01 2121 517
x[16] -5.83 0.00 6.23 0.31 8.60 1.01 2171 583
x[17] -18.10 -0.07 7.39 -6.89 47.38 1.03 207 65
x[18] -5.89 0.05 5.90 0.10 5.42 1.01 3409 660
x[19] -7.04 0.03 6.23 -0.23 12.26 1.01 1422 416
x[20] -6.58 0.03 7.53 -0.19 7.81 1.02 947 441
x[21] -8.11 0.01 6.57 -2.16 28.47 1.00 1785 391
x[22] -5.47 0.04 5.24 0.25 8.30 1.00 2836 706
x[23] -10.04 -0.05 6.55 -11.16 87.97 1.04 314 68
x[24] -13.18 -0.07 5.72 -4.58 30.89 1.01 431 95
x[25] -6.89 -0.02 5.60 -0.44 8.76 1.02 1888 500
x[26] -5.33 -0.02 4.69 -0.30 6.55 1.01 3095 727
x[27] -5.19 -0.04 5.17 -0.26 8.50 1.00 3113 868
x[28] -6.81 -0.02 6.81 0.15 7.71 1.00 4186 919
x[29] -9.96 -0.02 6.99 -1.66 18.87 1.01 926 219
x[30] -6.20 0.02 6.62 0.44 14.44 1.01 3500 552
x[31] -18.84 -0.04 5.77 -4.22 29.48 1.02 208 48
x[32] -6.33 -0.01 6.21 -0.02 6.73 1.02 3044 743
x[33] -4.81 0.03 4.93 0.22 7.89 1.01 3504 1003
x[34] -5.75 0.02 5.75 -0.12 9.23 1.00 1889 656
x[35] -5.98 0.03 7.16 0.06 8.68 1.01 2701 634
x[36] -7.37 0.00 9.19 0.08 10.82 1.01 1404 394
x[37] -5.03 0.02 7.82 2.52 29.21 1.01 2020 353
x[38] -5.40 -0.01 5.59 -0.06 6.64 1.01 3644 738
x[39] -4.78 -0.02 5.36 0.27 6.40 1.00 2849 752
x[40] -6.16 0.00 4.58 -0.91 10.45 1.01 1843 452
x[41] -5.46 0.07 6.55 0.05 6.06 1.01 2322 616
x[42] -7.63 -0.05 6.38 -0.75 10.10 1.01 1786 510
x[43] -7.47 -0.04 6.39 -0.59 8.32 1.01 1871 555
x[44] -5.22 0.00 5.17 0.11 6.15 1.01 1613 569
x[45] -4.44 0.04 13.98 3.19 17.67 1.01 295 70
x[46] -5.04 0.01 5.60 0.22 9.06 1.02 3151 432
x[47] -5.87 0.03 5.46 0.19 11.73 1.02 2247 495
x[48] -4.99 -0.03 4.91 0.05 10.22 1.01 2914 718
x[49] -6.42 0.01 10.78 10.69 83.66 1.02 223 51
x[50] -6.24 0.02 5.71 0.41 19.28 1.00 2610 783
I 0.00 1.00 1.00 0.51 0.50 1.00 683 4000
lp__ -88.54 -68.47 -50.77 -69.10 11.75 1.01 249 449
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
which_min_ess <- which.min(mon[1:50, 'Tail_ESS'])
Several values of Rhat greater than 1.01 and some ESS less than 400 also indicate convergence propblems. The Appendix C contains more results with longer chains.
We can further analyze potential problems using local efficiency and rank plots. We specifically investigate x[5], which, in this specific run, had the smallest tail-ESS of 44.
We examine the sampling efficiency in different parts of the posterior by computing the efficiency of small interval probability estimates (see Section Efficiency estimate for small interval probability estimates). Each interval contains \(1/k\) of the draws (e.g., \(5\%\) each, if \(k=20\)). The small interval efficiency measures mixing of an function which indicates when the values are inside or outside the specific small interval. As detailed above, this gives us a local efficiency measure which does not depend on the shape of the distribution.
plot_local_ess(fit = fit_nom, par = which_min_ess, nalpha = 20)
The efficiency of sampling is low in the tails, which is clearly caused by slow mixing in long tails of the Cauchy distribution.
Orange ticks show iterations that exceeded the maximum treedepth.
An alternative way to examine the efficiency in different parts of the posterior is to compute efficiency estimates for quantiles (see Section Efficiency for quantiles). Each interval has a specified proportion of draws, and the efficiency measures mixing of a function which indicates when the values are smaller than or equal to the corresponding quantile.
plot_quantile_ess(fit = fit_nom, par = which_min_ess, nalpha = 40)
Similar as above, we see that the efficiency of sampling is low in the tails. Orange ticks show iterations that exceeded the maximum treedepth.
We may also investigate how the estimated effective sample sizes change when we use more and more draws; Brooks and Gelman (1998) proposed to use similar graph for \(\widehat{R}\). If the effective sample size is highly unstable, does not increase proportionally with more draws, or even decreases, this indicates that simply running longer chains will likely not solve the convergence issues. In the plot below, we see how unstable both bulk-ESS and tail-ESS are for this example.
plot_change_ess(fit = fit_nom, par = which_min_ess)
We can further analyze potential problems using rank plots which clearly show the mixing problem between chains. In case of good mixing all rank plots should be close to uniform.
samp <- as.array(fit_nom)
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
Next, we examine an alternative parameterization of the Cauchy as a scale mixture of Gaussians. The model has two parameters, and the Cauchy distributed \(x\) can be computed deterministically from those. In addition to improved sampling performance, the example illustrates that focusing on diagnostics matters.
writeLines(readLines("cauchy_alt_1.stan"))
parameters {
vector[50] x_a;
vector<lower=0>[50] x_b;
}
transformed parameters {
vector[50] x = x_a ./ sqrt(x_b);
}
model {
x_a ~ normal(0, 1);
x_b ~ gamma(0.5, 0.5);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run the alternative model:
fit_alt1 <- stan(file = 'cauchy_alt_1.stan', seed = 7878, refresh = 0)
There are no warnings, and the sampling is much faster.
mon <- monitor(fit_alt1)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
x_a[1] -1.68 -0.03 1.60 -0.02 1.00 1 4260 3065
x_a[2] -1.59 -0.01 1.71 0.02 1.00 1 3650 3147
x_a[3] -1.62 0.00 1.61 0.01 1.00 1 4090 2887
x_a[4] -1.66 -0.02 1.58 -0.02 0.99 1 4145 3083
x_a[5] -1.69 -0.01 1.72 -0.01 1.02 1 4573 2513
x_a[6] -1.67 0.01 1.71 0.00 1.02 1 4280 2992
x_a[7] -1.69 0.03 1.71 0.02 1.01 1 4071 2907
x_a[8] -1.61 0.01 1.64 0.01 0.99 1 3998 2895
x_a[9] -1.71 0.02 1.64 0.01 1.00 1 3866 2987
x_a[10] -1.63 -0.01 1.60 -0.01 1.00 1 3981 2747
x_a[11] -1.60 -0.04 1.61 -0.01 0.99 1 4217 3576
x_a[12] -1.64 0.01 1.64 0.00 1.00 1 3669 2686
x_a[13] -1.70 -0.01 1.68 -0.01 1.02 1 3886 2905
x_a[14] -1.69 0.03 1.66 0.02 1.00 1 3963 2726
x_a[15] -1.63 -0.01 1.67 0.00 1.00 1 3741 3186
x_a[16] -1.65 -0.01 1.65 0.00 1.00 1 4344 3249
x_a[17] -1.66 -0.02 1.67 0.00 1.01 1 4345 2878
x_a[18] -1.56 0.01 1.55 0.00 0.95 1 3917 2641
x_a[19] -1.61 -0.02 1.65 -0.02 0.99 1 4046 2906
x_a[20] -1.64 0.01 1.72 0.01 1.01 1 4215 2879
x_a[21] -1.62 -0.01 1.58 -0.01 0.98 1 4314 2553
x_a[22] -1.61 0.00 1.61 -0.01 0.98 1 3891 3010
x_a[23] -1.69 -0.02 1.62 -0.03 1.00 1 3873 3066
x_a[24] -1.61 -0.02 1.61 -0.01 0.98 1 3485 3012
x_a[25] -1.65 0.01 1.64 0.00 1.01 1 4311 2981
x_a[26] -1.60 0.05 1.64 0.03 0.98 1 3960 3114
x_a[27] -1.64 -0.02 1.66 -0.02 1.01 1 3584 2797
x_a[28] -1.62 0.01 1.59 0.01 0.99 1 3899 2889
x_a[29] -1.60 0.02 1.58 0.02 0.97 1 4387 3171
x_a[30] -1.65 0.02 1.70 0.02 1.01 1 4316 3110
x_a[31] -1.64 0.00 1.64 0.00 0.98 1 3863 2659
x_a[32] -1.67 -0.01 1.65 0.00 1.00 1 4433 2690
x_a[33] -1.65 -0.02 1.59 -0.01 0.99 1 4236 2777
x_a[34] -1.67 0.02 1.69 0.01 1.00 1 4170 3164
x_a[35] -1.62 0.00 1.58 -0.01 0.98 1 3751 2628
x_a[36] -1.68 0.01 1.65 0.00 0.99 1 3896 2916
x_a[37] -1.60 0.03 1.66 0.02 0.99 1 4487 2494
x_a[38] -1.69 0.01 1.72 0.01 1.03 1 4030 2707
x_a[39] -1.64 -0.01 1.66 -0.01 1.01 1 4407 2787
x_a[40] -1.69 -0.02 1.60 -0.03 1.01 1 4144 2941
x_a[41] -1.64 0.02 1.65 0.01 1.00 1 4252 3139
x_a[42] -1.65 -0.02 1.61 -0.03 0.99 1 3663 3193
x_a[43] -1.62 -0.02 1.64 -0.01 1.00 1 3836 2287
x_a[44] -1.65 0.02 1.76 0.03 1.03 1 3758 3033
x_a[45] -1.69 0.04 1.71 0.02 1.03 1 3855 2971
x_a[46] -1.67 -0.01 1.70 0.01 1.02 1 4204 3109
x_a[47] -1.62 -0.03 1.65 -0.01 1.01 1 3967 3009
x_a[48] -1.64 0.01 1.67 0.01 1.01 1 4374 3361
x_a[49] -1.64 0.02 1.64 0.01 0.99 1 3825 3136
x_a[50] -1.65 -0.01 1.63 -0.01 1.01 1 3746 2687
x_b[1] 0.00 0.41 3.83 1.00 1.43 1 2374 1405
x_b[2] 0.00 0.47 3.73 1.00 1.39 1 2243 1420
x_b[3] 0.00 0.47 3.76 1.00 1.38 1 2860 1857
x_b[4] 0.01 0.49 3.95 1.02 1.37 1 2753 1598
x_b[5] 0.00 0.49 3.89 1.04 1.48 1 3378 1876
x_b[6] 0.00 0.46 3.81 1.00 1.38 1 2619 1389
x_b[7] 0.01 0.47 4.07 1.02 1.44 1 2831 1442
x_b[8] 0.01 0.45 3.75 1.01 1.42 1 2560 1180
x_b[9] 0.00 0.44 3.83 1.00 1.42 1 2785 1265
x_b[10] 0.00 0.48 3.78 1.01 1.40 1 2691 1388
x_b[11] 0.00 0.45 3.87 0.99 1.36 1 2809 1682
x_b[12] 0.00 0.42 3.94 1.00 1.44 1 2549 1379
x_b[13] 0.00 0.45 3.94 1.02 1.44 1 2504 1292
x_b[14] 0.00 0.44 3.71 0.97 1.39 1 2963 1635
x_b[15] 0.00 0.43 3.98 1.02 1.45 1 2545 1372
x_b[16] 0.01 0.49 3.62 1.00 1.38 1 2419 1408
x_b[17] 0.00 0.47 3.86 1.01 1.39 1 2424 1455
x_b[18] 0.00 0.46 3.79 0.98 1.38 1 2174 1352
x_b[19] 0.00 0.44 3.81 0.99 1.39 1 2982 1728
x_b[20] 0.00 0.48 3.61 0.97 1.33 1 3157 1683
x_b[21] 0.00 0.47 3.74 0.98 1.35 1 2889 1497
x_b[22] 0.00 0.45 3.85 1.02 1.48 1 2439 1467
x_b[23] 0.00 0.46 4.02 1.02 1.44 1 2825 1920
x_b[24] 0.00 0.46 3.79 1.00 1.45 1 2549 1438
x_b[25] 0.00 0.49 3.84 1.01 1.41 1 2510 1291
x_b[26] 0.00 0.44 4.08 1.03 1.51 1 3282 1670
x_b[27] 0.00 0.43 3.75 0.99 1.38 1 2441 1435
x_b[28] 0.00 0.45 3.77 1.00 1.43 1 2631 1289
x_b[29] 0.00 0.42 3.80 0.97 1.40 1 2566 1489
x_b[30] 0.00 0.45 3.82 0.99 1.42 1 2520 1629
x_b[31] 0.00 0.43 3.84 0.99 1.37 1 2811 1813
x_b[32] 0.00 0.45 3.75 1.00 1.41 1 2410 1348
x_b[33] 0.01 0.46 3.81 0.99 1.38 1 2323 1406
x_b[34] 0.00 0.46 3.89 1.01 1.49 1 2860 1488
x_b[35] 0.00 0.41 3.71 0.97 1.42 1 2534 1576
x_b[36] 0.00 0.44 3.60 0.97 1.33 1 2480 1798
x_b[37] 0.00 0.47 3.82 1.01 1.42 1 2791 1639
x_b[38] 0.00 0.47 4.12 1.04 1.49 1 2643 1489
x_b[39] 0.00 0.45 3.96 1.02 1.44 1 2592 1558
x_b[40] 0.00 0.45 3.68 0.99 1.40 1 2457 1478
x_b[41] 0.00 0.48 3.76 1.00 1.38 1 2768 1634
x_b[42] 0.00 0.46 3.95 1.01 1.43 1 2261 1179
x_b[43] 0.00 0.48 3.85 1.02 1.42 1 2516 1259
x_b[44] 0.00 0.46 3.50 0.97 1.32 1 2702 1473
x_b[45] 0.00 0.45 4.01 1.04 1.49 1 2873 1646
x_b[46] 0.00 0.45 3.94 1.00 1.42 1 2482 1654
x_b[47] 0.00 0.49 3.75 1.01 1.40 1 2541 1295
x_b[48] 0.00 0.46 3.79 0.99 1.42 1 2345 1269
x_b[49] 0.00 0.43 3.85 1.00 1.40 1 2497 1443
x_b[50] 0.00 0.46 3.78 1.01 1.44 1 2774 1414
x[1] -7.64 -0.04 7.33 -0.69 32.90 1 3894 2154
x[2] -6.02 -0.01 6.41 4.16 362.61 1 3522 1728
x[3] -6.54 0.00 6.33 0.19 33.51 1 3496 2544
x[4] -5.52 -0.02 5.74 -3.14 191.65 1 3774 2322
x[5] -6.20 -0.01 6.29 -0.47 21.58 1 4132 2627
x[6] -5.55 0.01 6.40 0.19 91.87 1 4172 2472
x[7] -5.73 0.05 6.63 -0.03 14.48 1 3751 2163
x[8] -5.80 0.01 5.58 -1.23 53.41 1 3654 2398
x[9] -5.96 0.03 6.01 -6.47 228.29 1 3568 2182
x[10] -5.73 -0.01 6.11 -1.49 88.67 1 3612 1967
x[11] -6.35 -0.05 6.08 0.98 48.99 1 3932 2516
x[12] -6.59 0.02 5.94 -0.79 30.31 1 3390 2201
x[13] -6.36 -0.01 6.24 -6.55 221.32 1 3898 2277
x[14] -6.64 0.04 6.51 3.57 149.72 1 3706 2545
x[15] -7.28 -0.01 5.90 -0.32 18.93 1 3695 2032
x[16] -5.65 -0.01 5.28 0.37 32.30 1 3940 2340
x[17] -6.10 -0.03 6.17 -0.33 26.14 1 3999 2155
x[18] -5.74 0.02 6.79 3.13 94.50 1 3866 2080
x[19] -6.01 -0.03 6.04 0.35 18.39 1 3802 2327
x[20] -5.56 0.01 6.22 -0.07 30.58 1 3875 2336
x[21] -6.21 -0.01 6.22 0.33 35.50 1 3907 2386
x[22] -6.23 0.00 6.02 -8.18 356.87 1 3701 1765
x[23] -6.17 -0.02 6.09 -0.37 14.57 1 3893 2626
x[24] -6.25 -0.02 5.56 -0.61 81.96 1 3133 2014
x[25] -6.72 0.02 5.47 1.85 202.62 1 3898 2316
x[26] -5.61 0.06 5.81 -0.34 34.70 1 3876 2844
x[27] -7.91 -0.03 6.89 6.55 323.96 1 3694 2563
x[28] -6.29 0.01 6.80 0.35 31.31 1 3827 2201
x[29] -6.57 0.02 6.30 -0.41 23.66 1 3796 2357
x[30] -5.92 0.03 6.31 -0.31 38.53 1 3879 2157
x[31] -5.80 0.00 6.28 0.36 35.66 1 3728 2578
x[32] -6.10 -0.02 6.39 -3.81 298.91 1 4410 2443
x[33] -6.00 -0.03 5.39 1.22 81.47 1 3641 2290
x[34] -5.87 0.04 6.56 0.09 30.46 1 3854 2567
x[35] -6.81 0.00 6.06 -0.07 27.08 1 3462 2256
x[36] -6.12 0.01 6.09 -0.81 40.91 1 3740 2638
x[37] -5.92 0.03 6.50 -6.16 263.64 1 4235 2368
x[38] -6.46 0.01 6.00 0.60 58.15 1 3707 2080
x[39] -6.79 -0.01 6.27 -1.86 64.47 1 4049 2316
x[40] -6.97 -0.02 5.62 0.25 146.44 1 3775 2195
x[41] -6.12 0.02 5.98 -0.88 67.65 1 3947 2551
x[42] -7.48 -0.03 6.42 -18.09 643.89 1 3643 2149
x[43] -5.99 -0.02 6.57 -0.40 52.47 1 3563 2327
x[44] -6.09 0.03 5.93 0.44 26.31 1 3772 2407
x[45] -6.25 0.04 6.65 -0.18 15.75 1 3756 2345
x[46] -5.88 -0.02 7.59 0.75 28.36 1 3575 2169
x[47] -6.18 -0.03 5.82 0.24 79.45 1 3777 2190
x[48] -6.61 0.01 6.03 -3.56 129.17 1 3840 2089
x[49] -5.88 0.02 6.52 -0.30 46.02 1 3834 2548
x[50] -7.08 0.00 6.32 -0.71 45.99 1 3522 2381
I 0.00 0.00 1.00 0.49 0.50 1 2926 4000
lp__ -95.29 -81.24 -69.52 -81.62 7.89 1 1517 2694
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
which_min_ess <- which.min(mon[101:150, 'Tail_ESS'])
For all parameters, Rhat is less than 1.01 and ESS exceeds 400, indicating that sampling worked much better with this alternative parameterization. Appendix C has more results using other parameterizations of the Cauchy distribution. The vectors x_a and x_b used to form the Cauchy-distributed x have stable quantile, mean and sd values. The quantiles of each x_j are stable too, but the mean and variance estimates are widely varying.
We can further analyze potential problems using local efficiency estimates and rank plots. For this example. we take a detailed look at x[2], which had the smallest bulk-ESS of 3147.
We examine the sampling efficiency in different parts of the posterior by computing the efficiency estimates for small interval probability estimates.
plot_local_ess(fit = fit_alt1, par = which_min_ess + 100, nalpha = 20)
The efficiency estimate is good in all parts of the posterior. Further, we examine the sampling efficiency of different quantile estimates.
plot_quantile_ess(fit = fit_alt1, par = which_min_ess + 100, nalpha = 40)
The rank plots also look close to uniform across chains, which is consistent with good mixing.
samp <- as.array(fit_alt1)
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
In summary, the alternative parameterization produces results that look much better than for the nominal parameterization. There are still some differences in the tails, which we also identified via the tail-ESS.
Half-Cauchy priors for non-negative parameters are common and, at least in Stan, usually specified via the nominal parameterization. In this example, we set independent half-Cauchy distributions on each element of the 50-dimensional vector \(x\) constrained to be positive (in Stan, <lower=0>). Stan then automatically switches to the unconstrained log(x) space, which changes the geometry crucially.
writeLines(readLines("half_cauchy_nom.stan"))
parameters {
vector<lower=0>[50] x;
}
model {
x ~ cauchy(0, 1);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run the half-Cauchy with nominal parameterization and positive constraint:
fit_half_nom <- stan(file = 'half_cauchy_nom.stan', seed = 7878, refresh = 0)
There are no warnings, and the sampling is much faster than for the Cauchy nominal model without the constraint.
mon <- monitor(fit_half_nom)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
x[1] 0.08 1.01 14.11 7.41 128.72 1.00 6705 1811
x[2] 0.08 0.99 13.81 9.58 159.58 1.00 6019 1751
x[3] 0.07 0.97 13.53 4.72 30.32 1.00 7470 2285
x[4] 0.09 0.99 11.20 4.99 43.10 1.00 8348 2125
x[5] 0.07 1.01 14.51 5.05 47.10 1.00 7567 2310
x[6] 0.08 1.02 13.91 91.28 5383.88 1.00 7493 2065
x[7] 0.08 0.99 13.46 6.57 93.25 1.00 7712 2009
x[8] 0.09 1.00 11.29 4.84 34.92 1.00 9379 2091
x[9] 0.07 0.99 13.40 10.80 162.43 1.00 7531 1955
x[10] 0.08 1.02 12.53 5.61 49.28 1.00 7457 2252
x[11] 0.09 1.00 10.67 6.78 116.46 1.01 7357 2427
x[12] 0.08 1.01 12.34 5.46 47.09 1.00 8039 2034
x[13] 0.08 1.01 11.84 4.77 62.16 1.00 7175 2071
x[14] 0.08 1.01 11.97 4.99 65.23 1.00 8532 2437
x[15] 0.08 0.98 13.84 6.31 61.42 1.00 8956 2054
x[16] 0.08 0.98 11.59 6.25 67.00 1.00 7819 2342
x[17] 0.08 0.97 11.95 3.92 19.12 1.00 6986 2272
x[18] 0.08 1.00 10.98 4.05 23.61 1.00 6999 2293
x[19] 0.08 1.03 12.08 38.38 2005.23 1.00 7613 2391
x[20] 0.07 1.00 14.24 5.68 39.23 1.00 8368 2055
x[21] 0.08 1.01 11.88 6.11 110.22 1.00 8162 2794
x[22] 0.09 1.02 11.18 7.01 169.77 1.00 7670 2004
x[23] 0.07 1.00 14.22 7.35 94.96 1.00 7419 2097
x[24] 0.07 0.99 12.30 6.63 130.07 1.00 7346 2007
x[25] 0.09 1.01 11.29 9.22 343.88 1.00 8103 2300
x[26] 0.07 1.05 13.90 7.72 130.68 1.00 7079 2415
x[27] 0.07 0.99 14.12 5.75 52.77 1.00 8381 1978
x[28] 0.07 1.00 15.56 7.49 73.53 1.00 8576 2072
x[29] 0.08 1.00 13.28 5.13 37.83 1.00 7709 2026
x[30] 0.06 0.99 14.14 7.69 139.13 1.00 6518 2047
x[31] 0.08 1.04 16.07 6.73 50.89 1.00 7130 2138
x[32] 0.09 1.01 13.23 5.32 43.21 1.00 7794 2524
x[33] 0.08 1.00 12.96 6.26 92.50 1.00 7319 2174
x[34] 0.09 0.97 11.01 3.78 23.01 1.00 7419 2390
x[35] 0.07 1.00 13.56 5.08 30.58 1.00 6842 2454
x[36] 0.07 1.01 13.18 6.52 80.50 1.00 7526 1794
x[37] 0.09 1.03 11.61 4.16 23.07 1.00 7229 2357
x[38] 0.08 0.99 15.73 5.35 58.79 1.00 8059 1870
x[39] 0.07 1.00 13.93 31.05 827.68 1.00 8216 1702
x[40] 0.07 1.03 12.88 10.80 173.36 1.00 7306 1764
x[41] 0.07 0.97 13.42 7.83 96.96 1.00 7207 1956
x[42] 0.09 0.99 11.94 8.61 152.58 1.00 7528 2437
x[43] 0.09 1.01 12.12 4.59 52.98 1.00 7381 2450
x[44] 0.08 1.02 12.66 4.62 34.61 1.00 6487 2006
x[45] 0.07 1.00 13.25 20.04 914.15 1.00 6961 2096
x[46] 0.07 1.03 14.46 4.87 27.22 1.00 7899 2126
x[47] 0.08 1.00 13.13 6.86 99.30 1.00 6569 2723
x[48] 0.07 1.02 12.49 6.80 87.98 1.00 7776 2610
x[49] 0.08 1.00 12.68 5.54 37.20 1.00 7916 2144
x[50] 0.10 0.99 11.95 8.59 155.18 1.00 8105 1710
I 0.00 0.00 1.00 0.50 0.50 1.00 5825 4000
lp__ -80.63 -69.45 -59.66 -69.65 6.45 1.00 1141 2003
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
All values of Rhat are less than 1.01 and ESS exceeds 400 for all parameters, indicating good performance of the sampler despite using the nominal parameterization of the Cauchy distribution. More experiments with the half-Cauchy distribution can be found in Appendix C.
This section relates to the examples presented in Section 5.2 of the paper.
The Eight Schools data is a classic example for hierarchical models (see Section 5.5 in Gelman et al., 2013), which even in its simplicity illustrates the typical problems in inference for hierarchical models. The Stan models are adapted from Michael Betancourt’s case study on Diagnosing Biased Inference with Divergences. Appendix D contains more detailed analysis of different algorithm variants.
writeLines(readLines("eight_schools_cp.stan"))
data {
int<lower=0> J;
real y[J];
real<lower=0> sigma[J];
}
parameters {
real mu;
real<lower=0> tau;
real theta[J];
}
model {
mu ~ normal(0, 5);
tau ~ cauchy(0, 5);
theta ~ normal(mu, tau);
y ~ normal(theta, sigma);
}
We directly run the centered parameterization model with an increased adapt_delta value to reduce the probability of getting divergent transitions.
eight_schools <- read_rdump("eight_schools.data.R")
fit_cp <- stan(
file = 'eight_schools_cp.stan', data = eight_schools,
iter = 2000, chains = 4, seed = 483892929, refresh = 0,
control = list(adapt_delta = 0.95)
)
Warning: There were 28 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: Examine the pairs() plot to diagnose sampling problems
Still, we observe a lot of divergent transitions, which in itself is already sufficient indicator of convergence problems. We can use Rhat and ESS diagnostics to recognize problematic parts of the posterior. The latter two have the advantage over the divergent transitions diagnostic that they can be used with all MCMC algorithms not only with HMC.
mon <- monitor(fit_cp)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
mu -0.93 4.53 9.85 4.50 3.26 1.01 695 1124
tau 0.71 3.02 9.79 3.86 3.12 1.02 218 229
theta[1] -1.24 5.88 16.01 6.41 5.52 1.01 991 1590
theta[2] -2.44 4.99 12.84 5.05 4.73 1.00 1247 2033
theta[3] -4.99 4.32 12.13 4.06 5.36 1.00 1168 1787
theta[4] -2.91 4.88 12.73 4.89 4.77 1.00 1024 1905
theta[5] -3.99 3.95 10.80 3.75 4.59 1.00 951 2013
theta[6] -3.97 4.38 11.45 4.16 4.71 1.01 1196 1701
theta[7] -0.91 6.04 15.06 6.44 4.99 1.00 936 1583
theta[8] -3.08 4.86 13.78 4.99 5.20 1.00 1287 1492
lp__ -24.50 -15.31 -4.61 -15.03 6.07 1.02 210 240
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
See Appendix D for results of longer chains.
Bulk-ESS and Tail-ESS for the between school standard deviation tau are 218 and 229 respectively. Both are much less than 400, indicating we should investigate that parameter more carefully. We thus examine the sampling efficiency in different parts of the posterior by computing the efficiency estimate for small interval estimates for tau. These plots may either show quantiles or parameter values at the vertical axis. Red ticks show divergent transitions.
plot_local_ess(fit = fit_cp, par = "tau", nalpha = 20)
plot_local_ess(fit = fit_cp, par = "tau", nalpha = 20, rank = FALSE)
We see that the sampler has difficulties in exploring small tau values. As the sampling efficiency for estimating small tau values is practically zero, we may assume that we miss substantial amount of posterior mass and get biased estimates. Red ticks, which show iterations with divergences, have concentrated to small tau values, which gives us another indication of problems in exploring small values.
We examine also the sampling efficiency of different quantile estimates. Again, these plots may either show quantiles or parameter values at the vertical axis.
plot_quantile_ess(fit = fit_cp, par = "tau", nalpha = 40)
plot_quantile_ess(fit = fit_cp, par = "tau", nalpha = 40, rank = FALSE)
Most of the quantile estimates have worryingly low effective sample size.
Next we examine how the estimated effective sample size changes when we use more and more draws. Here we do not see sudden changes, but both bulk-ESS and tail-ESS are consistently low. See Appendix D for results of longer chains.
plot_change_ess(fit = fit_cp, par = "tau")
In line with the other findings, rank plots of tau clearly show problems in the mixing of the chains. In particular, the rank plot for the first chain indicates that it was unable to explore the lower-end of the posterior range, while the spike in the rank plot for chain 2 indicates that it spent too much time stuck in these values.
samp_cp <- as.array(fit_cp)
mcmc_hist_r_scale(samp_cp[, , "tau"])
For hierarchical models, the non-centered parameterization often works better than the centered one:
writeLines(readLines("eight_schools_ncp.stan"))
data {
int<lower=0> J;
real y[J];
real<lower=0> sigma[J];
}
parameters {
real mu;
real<lower=0> tau;
real theta_tilde[J];
}
transformed parameters {
real theta[J];
for (j in 1:J)
theta[j] = mu + tau * theta_tilde[j];
}
model {
mu ~ normal(0, 5);
tau ~ cauchy(0, 5);
theta_tilde ~ normal(0, 1);
y ~ normal(theta, sigma);
}
For reasons of comparability, we also run the non-centered parameterization model with an increased adapt_delta value:
fit_ncp2 <- stan(
file = 'eight_schools_ncp.stan', data = eight_schools,
iter = 2000, chains = 4, seed = 483892929, refresh = 0,
control = list(adapt_delta = 0.95)
)
We get zero divergences and no other warnings which is a first good sign.
mon <- monitor(fit_ncp2)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
mu -1.00 4.41 9.80 4.43 3.29 1 5201 3113
tau 0.24 2.73 9.46 3.56 3.17 1 2669 1967
theta_tilde[1] -1.35 0.30 1.96 0.31 1.00 1 5593 2531
theta_tilde[2] -1.45 0.11 1.63 0.10 0.92 1 5607 2953
theta_tilde[3] -1.70 -0.09 1.60 -0.08 1.00 1 5602 2858
theta_tilde[4] -1.44 0.09 1.63 0.08 0.93 1 5810 3245
theta_tilde[5] -1.66 -0.17 1.39 -0.16 0.92 1 5022 3163
theta_tilde[6] -1.60 -0.07 1.54 -0.06 0.96 1 5057 2917
theta_tilde[7] -1.31 0.34 1.88 0.33 0.98 1 4562 2750
theta_tilde[8] -1.50 0.07 1.66 0.08 0.96 1 5754 2988
theta[1] -1.78 5.66 16.24 6.19 5.70 1 4396 3159
theta[2] -2.33 4.83 12.35 4.89 4.54 1 5315 3416
theta[3] -5.07 4.17 11.92 3.99 5.33 1 4351 3223
theta[4] -2.77 4.76 12.52 4.85 4.76 1 5822 3155
theta[5] -4.41 3.85 10.59 3.64 4.61 1 4822 3414
theta[6] -3.82 4.28 11.45 4.07 4.80 1 4864 2818
theta[7] -1.18 5.86 15.20 6.26 5.01 1 4804 3569
theta[8] -3.36 4.78 12.93 4.89 5.15 1 5097 3333
lp__ -11.20 -6.67 -3.76 -6.97 2.35 1 1485 2251
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
All Rhat < 1.01 and ESS > 400 indicate a much better performance of the non-centered parameterization.
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates for tau.
plot_local_ess(fit = fit_ncp2, par = 2, nalpha = 20)
Small tau values are still more difficult to explore, but the relative efficiency is good. We may also check this with a finer resolution:
plot_local_ess(fit = fit_ncp2, par = 2, nalpha = 50)
The sampling efficiency for different quantile estimates looks good as well.
plot_quantile_ess(fit = fit_ncp2, par = 2, nalpha = 40)
The rank plots of tau show no substantial differences between chains.
samp_ncp2 <- as.array(fit_ncp2)
mcmc_hist_r_scale(samp_ncp2[, , 2])
Betancourt, M. (2017) ‘A conceptual introduction to hamiltonian monte carlo’, arXiv preprint arXiv:1701.02434.
Brooks, S. P. and Gelman, A. (1998) ‘General methods for monitoring convergence of iterative simulations’, Journal of Computational and Graphical Statistics, 7(4), pp. 434–455.
Gelman, A. et al. (2013) Bayesian data analysis, third edition. CRC Press.
The following abbreviations are used throughout the appendices:
There are no examples or numerical experiments related to Appendix A in the paper.
This part focuses on diagnostics for
To simplify, in this part, independent draws are used as a proxy for very efficient MCMC sampling. First, we sample draws from a standard-normal distribution. We will discuss the behavior for non-normal distributions later. See Appendix A for the abbreviations used.
All chains are from the same Normal(0, 1) distribution plus a linear trend added to all chains:
conds <- expand.grid(
iters = c(250, 1000, 4000),
trend = c(0, 0.25, 0.5, 0.75, 1),
rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
iters <- conds[i, "iters"]
trend <- conds[i, "trend"]
rep <- conds[i, "rep"]
r <- array(rnorm(iters * chains), c(iters, chains))
r <- r + seq(-trend, trend, length.out = iters)
rs <- as.data.frame(monitor_extra(r))
res[[i]] <- cbind(iters, trend, rep, rs)
}
res <- bind_rows(res)
If we don’t split chains, Rhat misses the trends if all chains still have a similar marginal distribution.
ggplot(data = res, aes(y = Rhat, x = trend)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ggtitle('Rhat without splitting chains')
Split-Rhat can detect trends, even if the marginals of the chains are similar.
ggplot(data = res, aes(y = zsRhat, x = trend)) +
geom_point() + geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ggtitle('Split-Rhat')
Result: Split-Rhat is useful for detecting non-stationarity (i.e., trends) in the chains. If we use a threshold of \(1.01\), we can detect trends which account for 2% or more of the total marginal variance. If we use a threshold of \(1.1\), we detect trends which account for 30% or more of the total marginal variance.
The effective sample size is based on split Rhat and within-chain autocorrelation. We plot the relative efficiency \(R_{\rm eff}=S_{\rm eff}/S\) for easier comparison between different values of \(S\). In the plot below, dashed lines indicate the threshold at which we would consider the effective sample size to be sufficient (i.e., \(S_{\rm eff} > 400\)). Since we plot the relative efficiency instead of the effective sample size itself, this threshold is divided by \(S\), which we compute here as the number of iterations per chain (variable iter) times the number of chains (\(4\)).
ggplot(data = res, aes(y = zsreff, x = trend)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = c(0,1)) +
geom_hline(aes(yintercept = 400 / (4 * iters)), linetype = 'dashed') +
ggtitle('Relative Bulk-ESS (zsreff)') +
scale_y_continuous(breaks = seq(0, 1.5, by = 0.25))
Result: Split-Rhat is more sensitive to trends for small sample sizes, but effective sample size becomes more sensitive for larger samples sizes (as autocorrelations can be estimated more accurately).
Advice: If in doubt, run longer chains for more accurate convergence diagnostics.
Next we investigate the sensitivity to detect if one of the chains has not converged to the same distribution as the others, but has a different mean.
conds <- expand.grid(
iters = c(250, 1000, 4000),
shift = c(0, 0.25, 0.5, 0.75, 1),
rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
iters <- conds[i, "iters"]
shift <- conds[i, "shift"]
rep <- conds[i, "rep"]
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] + shift
rs <- as.data.frame(monitor_extra(r))
res[[i]] <- cbind(iters, shift, rep, rs)
}
res <- bind_rows(res)
ggplot(data = res, aes(y = zsRhat, x = shift)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ggtitle('Split-Rhat')
Result: If we use a threshold of \(1.01\), we can detect shifts with a magnitude of one third or more of the marginal standard deviation. If we use a threshold of \(1.1\), we detect shifts with a magnitude equal to or larger than the marginal standard deviation.
ggplot(data = res, aes(y = zsreff, x = shift)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = c(0,1)) +
geom_hline(aes(yintercept = 400 / (4 * iters)), linetype = 'dashed') +
ggtitle('Relative Bulk-ESS (zsreff)') +
scale_y_continuous(breaks = seq(0, 1.5, by = 0.25))
Result: The effective sample size is not as sensitive, but a shift with a magnitude of half the marginal standard deviation or more will lead to very low relative efficiency when the total number of draws increases.
Rank plots can be used to visualize differences between chains. Here, we show rank plots for the case of 4 chains, 250 draws per chain, and a shift of 0.5.
iters = 250
chains = 4
shift = 0.5
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] + shift
colnames(r) <- 1:4
mcmc_hist_r_scale(r)
Although, Rhat was less than \(1.05\) for this situation, the rank plots clearly show that the first chains behaves differently.
Next, we investigate the sensitivity to detect if one of the chains has not converged to the same distribution as the others, but has lower marginal variance.
conds <- expand.grid(
iters = c(250, 1000, 4000),
scale = c(0, 0.25, 0.5, 0.75, 1),
rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
iters <- conds[i, "iters"]
scale <- conds[i, "scale"]
rep <- conds[i, "rep"]
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] * scale
rs <- as.data.frame(monitor_extra(r))
res[[i]] <- cbind(iters, scale, rep, rs)
}
res <- bind_rows(res)
We first look at the Rhat estimates:
ggplot(data = res, aes(y = zsRhat, x = scale)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ggtitle('Split-Rhat')
Result: Split-Rhat is not able to detect scale differences between chains.
ggplot(data = res, aes(y = zfsRhat, x = scale)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = 1.005, linetype = 'dashed') +
geom_hline(yintercept = 1) +
ggtitle('Folded-split-Rhat')
Result: Folded-Split-Rhat focuses on scales and detects scale differences.
Result: If we use a threshold of \(1.01\), we can detect a chain with scale less than \(3/4\) of the standard deviation of the others. If we use threshold of \(1.1\), we detect a chain with standard deviation less than \(1/4\) of the standard deviation of the others.
Next, we look at the effective sample size estimates:
ggplot(data = res, aes(y = zsreff, x = scale)) +
geom_point() +
geom_jitter() +
facet_grid(. ~ iters) +
geom_hline(yintercept = c(0,1)) +
geom_hline(aes(yintercept = 400 / (4 * iters)), linetype = 'dashed') +
ggtitle('Relative Bulk-ESS (zsreff)') +
scale_y_continuous(breaks = seq(0, 1.5, by = 0.25))
Result: The bulk effective sample size of the mean does not see a problem as it focuses on location differences between chains.
Rank plots can be used to visualize differences between chains. Here, we show rank plots for the case of 4 chains, 250 draws per chain, and with one chain having a standard deviation of 0.75 as opposed to a standard deviation of 1 for the other chains.
iters = 250
chains = 4
scale = 0.75
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] * scale
colnames(r) <- 1:4
mcmc_hist_r_scale(r)
Although folded Rhat is \(1.06\), the rank plots clearly show that the first chains behaves differently.
The classic split-Rhat is based on calculating within and between chain variances. If the marginal distribution of a chain is such that the variance is not defined (i.e. infinite), the classic split-Rhat is not well justified. In this section, we will use the Cauchy distribution as an example of such distribution. Also in cases where mean and variance are finite, the distribution can be far from Gaussian. Especially distributions with very long tails cause instability for variance and autocorrelation estimates. To alleviate these problems we will use Split-Rhat for rank-normalized draws.
The following Cauchy models are from Michael Betancourt’s case study Fitting The Cauchy Distribution
We already looked at the nominal Cauchy model with direct parameterization in the main text, but for completeness, we take a closer look using different variants of the diagnostics.
writeLines(readLines("cauchy_nom.stan"))
parameters {
vector[50] x;
}
model {
x ~ cauchy(0, 1);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run the nominal model:
fit_nom <- stan(file = 'cauchy_nom.stan', seed = 7878, refresh = 0)
Warning: There were 1421 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems
Treedepth exceedence and Bayesian Fraction of Missing Information are dynamic HMC specific diagnostics (Betancourt, 2017). We get warnings about very large number of transitions after warmup that exceeded the maximum treedepth, which is likely due to very long tails of the Cauchy distribution. All chains have low estimated Bayesian fraction of missing information also indicating slow mixing.
Trace plots for the first parameter look wild with occasional large values:
samp <- as.array(fit_nom)
mcmc_trace(samp[, , 1])
Let’s check Rhat and ESS diagnostics.
res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)
For one parameter, Rhats exceed the classic threshold of 1.1. Depending on the Rhat estimate, a few others also exceed the threshold of 1.01. The rank normalized split-Rhat has several values over 1.01. Please note that the classic split-Rhat is not well defined in this example, because mean and variance of the Cauchy distribution are not finite.
plot_ess(res)
Both classic and new effective sample size estimates have several very small values, and so the overall sample shouldn’t be trusted.
Result: Effective sample size is more sensitive than (rank-normalized) split-Rhat to detect problems of slow mixing.
We also check the log posterior value lp__ and find out that the effective sample size is worryingly low.
res <- monitor_extra(samp[, , 51:52])
cat('lp__: Bulk-ESS = ', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 249
cat('lp__: Tail-ESS = ', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 449
We can further analyze potential problems using local effective sample size and rank plots. We examine x[5], which has the smallest tail-ESS of 249.
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates (see Section Efficiency for small interval probability estimates). Each interval contains \(1/k\) of the draws (e.g., with \(k=20\)). The small interval efficiency measures mixing of an indicator function which indicates when the values are inside the specific small interval. This gives us a local efficiency measure which does not depend on the shape of the distribution.
plot_local_ess(fit = fit_nom, par = which_min_ess, nalpha = 20)
We see that the efficiency is worryingly low in the tails (which is caused by slow mixing in long tails of Cauchy). Orange ticks show draws that exceeded the maximum treedepth.
An alternative way to examine the effective sample size in different parts of the posterior is to compute effective sample size for quantiles (see Section Efficiency for quantiles). Each interval has a specified proportion of draws, and the efficiency measures mixing of an indicator function’s results which indicate when the values are inside the specific interval.
plot_quantile_ess(fit = fit_nom, par = which_min_ess, nalpha = 40)
We see that the efficiency is worryingly low in the tails (which is caused by slow mixing in long tails of Cauchy). Orange ticks show draws that exceeded the maximum treedepth.
We can further analyze potential problems using rank plots, from which we clearly see differences between chains.
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
We can try to improve the performance by increasing max_treedepth to \(20\):
fit_nom_td20 <- stan(
file = 'cauchy_nom.stan', seed = 7878,
refresh = 0, control = list(max_treedepth = 20)
)
Warning: There were 1 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 20. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems
Trace plots for the first parameter still look wild with occasional large values.
samp <- as.array(fit_nom_td20)
mcmc_trace(samp[, , 1])
res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$tailseff)
We check the diagnostics for all \(x\).
plot_rhat(res)
All Rhats are below \(1.1\), but many are over \(1.01\). Classic split-Rhat has more variation than the rank normalized Rhat (note that the former is not well defined). The folded rank normalized Rhat shows that there is still more variation in the scale than in the location between different chains.
plot_ess(res)
Some classic effective sample sizes are very small. If we wouldn’t realize that the variance is infinite, we might try to run longer chains, but in case of an infinite variance, zero relative efficiency (ESS/S) is the truth and longer chains won’t help with that. However other quantities can be well defined, and that’s why it is useful to also look at the rank normalized version as a generic transformation to achieve finite mean and variance. The smallest bulk-ESS is less than 1000, which is not that bad. The smallest median-ESS is larger than 2500, that is we are able to estimate the median efficiently. However, many tail-ESS’s are less than 400 indicating problems for estimating the scale of the posterior.
Result: The rank normalized effective sample size is more stable than classic effective sample size, which is not well defined for the Cauchy distribution.
Result: It is useful to look at both bulk- and tail-ESS.
We check also lp__. Although increasing max_treedepth improved bulk-ESS of x, the efficiency for lp__ didn’t change.
res <- monitor_extra(samp[, , 51:52])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 155
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 476
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.
plot_local_ess(fit = fit_nom_td20, par = which_min_ess, nalpha = 20)
It seems that increasing max_treedepth has not much improved the efficiency in the tails. We also examine the effective sample size of different quantile estimates.
plot_quantile_ess(fit = fit_nom_td20, par = which_min_ess, nalpha = 40)
The rank plot visualisation of x[9], which has the smallest tail-ESS of NaN among the \(x\), indicates clear convergence problems.
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
The rank plot visualisation of lp__, which has an effective sample size 155, doesn’t look so good either.
mcmc_hist_r_scale(samp[, , "lp__"])
Let’s try running 8 times longer chains.
fit_nom_td20l <- stan(
file = 'cauchy_nom.stan', seed = 7878,
refresh = 0, control = list(max_treedepth = 20),
warmup = 1000, iter = 9000
)
Warning: There were 2 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 20. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems
Trace plots for the first parameter still look wild with occasional large values.
samp <- as.array(fit_nom_td20l)
mcmc_trace(samp[, , 1])
res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$tailseff)
Let’s check the diagnostics for all \(x\).
plot_rhat(res)
All Rhats are below \(1.01\). The classic split-Rhat has more variation than the rank normalized Rhat (note that the former is not well defined in this case).
plot_ess(res)
Most classic ESS’s are close to zero. Running longer chains just made most classic ESS’s even smaller.
The smallest bulk-ESS are around 5000, which is not that bad. The smallest median-ESS’s are larger than 25000, that is we are able to estimate the median efficiently. However, the smallest tail-ESS is 823 indicating problems for estimating the scale of the posterior.
Result: The rank normalized effective sample size is more stable than classic effective sample size even for very long chains.
Result: It is useful to look at both bulk- and tail-ESS.
We also check lp__. Although increasing the number of iterations improved bulk-ESS of the \(x\), the relative efficiency for lp__ didn’t change.
res <- monitor_extra(samp[, , 51:52])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 1285
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 2100
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.
plot_local_ess(fit = fit_nom_td20l, par = which_min_ess, nalpha = 20)
Increasing the chain length did not seem to change the relative efficiency. With more draws from the longer chains we can use a finer resolution for the local efficiency estimates.
plot_local_ess(fit = fit_nom_td20l, par = which_min_ess, nalpha = 100)
The sampling efficiency far in the tails is worryingly low. This was more difficult to see previously with less draws from the tails. We see similar problems in the plot of effective sample size for quantiles.
plot_quantile_ess(fit = fit_nom_td20l, par = which_min_ess, nalpha = 100)
Let’s look at the rank plot visualisation of x[48], which has the smallest tail-ESS NaN among the \(x\).
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
Increasing the number of iterations couldn’t remove the mixing problems at the tails. The mixing problem is inherent to the nominal parameterization of Cauchy distribution.
Next, we examine an alternative parameterization and consider the Cauchy distribution as a scale mixture of Gaussian distributions. The model has two parameters and the Cauchy distributed \(x\) can be computed from those. In addition to improved sampling performance, the example illustrates that focusing on diagnostics matters.
writeLines(readLines("cauchy_alt_1.stan"))
parameters {
vector[50] x_a;
vector<lower=0>[50] x_b;
}
transformed parameters {
vector[50] x = x_a ./ sqrt(x_b);
}
model {
x_a ~ normal(0, 1);
x_b ~ gamma(0.5, 0.5);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
We run the alternative model:
fit_alt1 <- stan(file='cauchy_alt_1.stan', seed=7878, refresh = 0)
There are no warnings and the sampling is much faster.
samp <- as.array(fit_alt1)
res <- monitor_extra(samp[, , 101:150])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)
All Rhats are below \(1.01\). Classic split-Rhats also look good even though they are not well defined for the Cauchy distribution.
plot_ess(res)
Result: Rank normalized ESS’s have less variation than classic one which is not well defined for Cauchy.
We check lp__:
res <- monitor_extra(samp[, , 151:152])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 1517
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 2694
The relative efficiencies for lp__ are also much better than with the nominal parameterization.
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.
plot_local_ess(fit = fit_alt1, par = 100 + which_min_ess, nalpha = 20)
The effective sample size is good in all parts of the posterior. We also examine the effective sample size of different quantile estimates.
plot_quantile_ess(fit = fit_alt1, par = 100 + which_min_ess, nalpha = 40)
We compare the mean relative efficiencies of the underlying parameters in the new parameterization and the actual \(x\) we are interested in.
res <- monitor_extra(samp[, , 101:150])
res1 <- monitor_extra(samp[, , 1:50])
res2 <- monitor_extra(samp[, , 51:100])
cat('Mean Bulk-ESS for x =' , round(mean(res[, 'zsseff']), 2), '\n')
Mean Bulk-ESS for x = 3782.24
cat('Mean Tail-ESS for x =' , round(mean(res[, 'tailseff']), 2), '\n')
Mean Tail-ESS for x = 2309.64
cat('Mean Bulk-ESS for x_a =' , round(mean(res1[, 'zsseff']), 2), '\n')
Mean Bulk-ESS for x_a = 4043.48
cat('Mean Bulk-ESS for x_b =' , round(mean(res2[, 'zsseff']), 2), '\n')
Mean Bulk-ESS for x_b = 2638.64
Result: We see that the effective sample size of the interesting \(x\) can be different from the effective sample size of the parameters \(x_a\) and \(x_b\) that we used to compute it.
The rank plot visualisation of x[2], which has the smallest tail-ESS of 1728 among the \(x\) looks better than for the nominal parameterization.
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
Similarly, the rank plot visualisation of lp__, which has a relative efficiency of -81.62, 0.2, 7.89, -95.29, -81.24, -69.52, 1496, 0.37, 1507, 1506, 1517, 0.38, 1, 1, 1, 1, 1, 2931, 0.73, 2694, 0.67, 1945, 0.49, 2973, 0.74 looks better than for the nominal parameterization.
mcmc_hist_r_scale(samp[, , "lp__"])
Another alternative parameterization is obtained by a univariate transformation as shown in the following code (see also the 3rd alternative in Michael Betancourt’s case study).
writeLines(readLines("cauchy_alt_3.stan"))
parameters {
vector<lower=0, upper=1>[50] x_tilde;
}
transformed parameters {
vector[50] x = tan(pi() * (x_tilde - 0.5));
}
model {
// Implicit uniform prior on x_tilde
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
We run the alternative model:
fit_alt3 <- stan(file='cauchy_alt_3.stan', seed=7878, refresh = 0)
There are no warnings, and the sampling is much faster than for the nominal model.
samp <- as.array(fit_alt3)
res <- monitor_extra(samp[, , 51:100])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)
All Rhats except some folded Rhats are below 1.01. Classic split-Rhat’s look also good even though it is not well defined for the Cauchy distribution.
plot_ess(res)
Result: Rank normalized relative efficiencies have less variation than classic ones. Bulk-ESS and median-ESS are slightly larger than 1, which is possible for antithetic Markov chains which have negative correlation for odd lags.
We also take a closer look at the lp__ value:
res <- monitor_extra(samp[, , 101:102])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 1334
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 2061
The effective sample size for these are also much better than with the nominal parameterization.
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.
plot_local_ess(fit = fit_alt3, par = 50 + which_min_ess, nalpha = 20)
We examine also the sampling efficiency of different quantile estimates.
plot_quantile_ess(fit = fit_alt3, par = 50 + which_min_ess, nalpha = 40)
The effective sample size in tails is worse than for the first alternative parameterization, although it’s still better than for the nominal parameterization.
We compare the mean effective sample size of the underlying parameter in the new parameterization and the actually Cauchy distributed \(x\) we are interested in.
res <- monitor_extra(samp[, , 51:100])
res1 <- monitor_extra(samp[, , 1:50])
cat('Mean bulk-seff for x =' , round(mean(res[, 'zsseff']), 2), '\n')
Mean bulk-seff for x = 4888.2
cat('Mean tail-seff for x =' , round(mean(res[, 'zfsseff']), 2), '\n')
Mean tail-seff for x = 1602.1
cat('Mean bulk-seff for x_tilde =' , round(mean(res1[, 'zsseff']), 2), '\n')
Mean bulk-seff for x_tilde = 4888.2
cat('Mean tail-seff for x_tilde =' , round(mean(res1[, 'zfsseff']), 2), '\n')
Mean tail-seff for x_tilde = 1609.82
The Rank plot visualisation of x[23], which has the smallest tail-ESS of 1926 among the \(x\) reveals shows good efficiency, similar to the results for lp__.
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
mcmc_hist_r_scale(samp[, , "lp__"])
Half-Cauchy priors are common and, for example, in Stan usually set using the nominal parameterization. However, when the constraint <lower=0> is used, Stan does the sampling automatically in the unconstrained log(x) space, which changes the geometry crucially.
writeLines(readLines("half_cauchy_nom.stan"))
parameters {
vector<lower=0>[50] x;
}
model {
x ~ cauchy(0, 1);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
We run the half-Cauchy model with nominal parameterization (and positive constraint).
fit_half_nom <- stan(file = 'half_cauchy_nom.stan', seed = 7878, refresh = 0)
There are no warnings and the sampling is much faster than for the full Cauchy distribution with nominal parameterization.
samp <- as.array(fit_half_nom)
res <- monitor_extra(samp[, , 1:50])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)
All Rhats are below \(1.01\). Classic split-Rhats also look good even though they are not well defined for the half-Cauchy distribution.
plot_ess(res)
Result: Rank normalized effective sample size have less variation than classic ones. Some Bulk-ESS and median-ESS are larger than 1, which is possible for antithetic Markov chains which have negative correlation for odd lags.
Due to the <lower=0> constraint, the sampling was made in the log(x) space, and we can also check the performance in that space.
res <- monitor_extra(log(samp[, , 1:50]))
plot_ess(res)
\(\log(x)\) is quite close to Gaussian, and thus classic effective sample size is also close to rank normalized ESS which is exactly the same as for the original \(x\) as rank normalization is invariant to bijective transformations.
Result: The rank normalized effective sample size is close to the classic effective sample size for transformations which make the distribution close to Gaussian.
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.
plot_local_ess(fit = fit_half_nom, par = which_min_ess, nalpha = 20)
The effective sample size is good overall, with only a small dip in tails. We can also examine the effective sample size of different quantile estimates.
plot_quantile_ess(fit = fit_half_nom, par = which_min_ess, nalpha = 40)
The rank plot visualisation of x[39], which has the smallest tail-ESS of 1702 among \(x\), looks good.
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
The rank plot visualisation of lp__ reveals some small differences in the scales, but it’s difficult to know whether this small variation from uniform is relevant.
mcmc_hist_r_scale(samp[, , "lp__"])
writeLines(readLines("half_cauchy_alt.stan"))
parameters {
vector<lower=0>[50] x_a;
vector<lower=0>[50] x_b;
}
transformed parameters {
vector[50] x = x_a .* sqrt(x_b);
}
model {
x_a ~ normal(0, 1);
x_b ~ inv_gamma(0.5, 0.5);
}
generated quantities {
real I = fabs(x[1]) < 1 ? 1 : 0;
}
Run half-Cauchy with alternative parameterization
fit_half_reparam <- stan(
file = 'half_cauchy_alt.stan', seed = 7878, refresh = 0
)
There are no warnings and the sampling is as fast for the half-Cauchy nominal model.
samp <- as.array(fit_half_reparam)
res <- monitor_extra(samp[, , 101:150])
which_min_ess <- which.min(res$tailseff)
plot_rhat(res)
plot_ess(res)
Result: The Rank normalized relative efficiencies have less variation than classic ones which is not well defined for the Cauchy distribution. Based on bulk-ESS and median-ESS, the efficiency for central quantities is much lower, but based on tail-ESS and MAD-ESS, the efficiency in the tails is slightly better than for the half-Cauchy distribution with nominal parameterization. We also see that a parameterization which is good for the full Cauchy distribution is not necessarily good for the half-Cauchy distribution as the <lower=0> constraint additionally changes the parameterization.
We also check the lp__ values:
res <- monitor_extra(samp[, , 151:152])
cat('lp__: Bulk-ESS =', round(res['lp__', 'zsseff'], 2), '\n')
lp__: Bulk-ESS = 734
cat('lp__: Tail-ESS =', round(res['lp__', 'tailseff'], 2), '\n')
lp__: Tail-ESS = 1513
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small interval probability estimates.
plot_local_ess(fit_half_reparam, par = 100 + which_min_ess, nalpha = 20)
We also examine the effective sample size for different quantile estimates.
plot_quantile_ess(fit_half_reparam, par = 100 + which_min_ess, nalpha = 40)
The effective sample size near zero is much worse than for the half-Cauchy distribution with nominal parameterization.
The Rank plot visualisation of x[13], which has the smallest tail-ESS of NaN among the \(x\), reveals deviations from uniformity, which is expected with lower effective sample size.
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp[, , xmin])
A similar result is obtained when looking at the rank plots of lp__.
mcmc_hist_r_scale(samp[, , "lp__"])
So far, we have run all models in Stan, but we want to also investigate whether similar problems arise with probabilistic programming languages that use other samplers than variants of Hamiltonian Monte-Carlo. Thus, we will fit the eight schools models also with Jags, which uses a dialect of the BUGS language to specify models. Jags uses a clever mix of Gibbs and Metropolis-Hastings sampling. This kind of sampling does not scale well to high dimensional posteriors of strongly interdependent parameters, but for the relatively simple models discussed in this case study it should work just fine.
The Jags code for the nominal parameteriztion of the cauchy distribution looks as follows:
writeLines(readLines("cauchy_nom.bugs"))
model {
for (i in 1:50) {
x[i] ~ dt(0, 1, 1)
}
}
First, we initialize the Jags model for reusage later.
jags_nom <- jags.model(
"cauchy_nom.bugs",
n.chains = 4, n.adapt = 10000
)
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 0
Unobserved stochastic nodes: 50
Total graph size: 52
Initializing model
Next, we sample 1000 iterations for each of the 4 chains for easy comparison with the corresponding Stan results.
samp_jags_nom <- coda.samples(
jags_nom, variable.names = "x",
n.iter = 1000
)
samp_jags_nom <- aperm(abind(samp_jags_nom, along = 3), c(1, 3, 2))
dimnames(samp_jags_nom)[[2]] <- paste0("chain:", 1:4)
We summarize the model as follows:
mon <- monitor(samp_jags_nom)
print(mon)
Inference for the input samples (4 chains: each with iter = 1000; warmup = 0):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
x[1] -6.98 0.00 5.78 -0.22 96.29 1 3905 4015
x[2] -5.80 -0.01 6.38 -0.20 20.37 1 3934 3566
x[3] -6.57 -0.02 5.97 0.85 47.42 1 3959 3931
x[4] -6.17 0.00 6.40 24.56 1573.65 1 3984 3931
x[5] -6.05 0.01 6.36 0.05 20.26 1 3719 3861
x[6] -6.16 0.03 6.49 -0.52 58.84 1 3783 3882
x[7] -5.85 -0.04 6.36 0.44 77.65 1 4047 3828
x[8] -6.21 0.02 5.82 0.58 41.36 1 3675 3794
x[9] -6.11 -0.01 6.56 1.41 125.51 1 3956 3890
x[10] -6.08 0.01 5.96 -0.71 57.03 1 4028 3657
x[11] -5.83 0.05 6.56 -0.58 25.10 1 4119 3831
x[12] -5.93 0.01 6.61 0.39 112.68 1 4128 3519
x[13] -6.56 -0.03 6.48 3.21 146.41 1 4085 3735
x[14] -6.85 0.01 5.72 2.89 221.05 1 4170 3892
x[15] -6.23 0.04 7.33 13.20 957.25 1 3810 3929
x[16] -5.77 0.02 6.77 1.74 61.86 1 3836 3737
x[17] -5.91 -0.02 6.31 -0.92 61.47 1 3881 3968
x[18] -6.61 0.02 6.76 -1.54 150.23 1 3820 3838
x[19] -7.07 0.02 7.19 -3.71 316.37 1 3856 3972
x[20] -6.16 0.03 6.78 2.77 122.62 1 4029 3930
x[21] -6.43 -0.05 5.77 -9.08 545.35 1 4079 3723
x[22] -6.10 0.01 6.26 -3.28 187.40 1 3948 4057
x[23] -6.76 0.03 6.42 2.18 70.38 1 3648 3673
x[24] -6.24 -0.02 7.50 3.41 175.73 1 4081 4190
x[25] -6.30 -0.02 6.26 0.70 78.88 1 3843 3813
x[26] -6.24 0.03 5.95 4.69 326.15 1 3977 3893
x[27] -6.35 -0.02 5.76 1.10 40.67 1 3909 3544
x[28] -6.42 -0.01 6.15 -5.02 323.93 1 4139 3811
x[29] -6.75 0.01 6.25 -0.55 28.34 1 3918 3419
x[30] -6.35 -0.01 5.93 -0.40 25.65 1 3780 3696
x[31] -5.93 0.01 6.47 4.14 263.07 1 4027 3748
x[32] -6.33 0.01 6.96 0.21 31.84 1 3715 3754
x[33] -5.77 -0.04 5.94 -0.27 54.68 1 4065 3725
x[34] -5.35 0.02 6.72 -4.46 165.38 1 4016 3994
x[35] -6.57 0.00 6.39 1.63 137.53 1 4006 3929
x[36] -7.19 -0.01 5.74 9.99 544.05 1 3710 3901
x[37] -6.70 -0.01 6.99 -1.55 89.90 1 3962 3741
x[38] -7.04 0.02 5.98 -1.69 71.28 1 3948 3734
x[39] -6.84 0.00 6.06 4.07 199.17 1 3882 3890
x[40] -6.63 -0.02 5.82 10.41 589.01 1 3902 3737
x[41] -6.14 0.05 6.26 1.21 57.24 1 3528 3510
x[42] -6.62 -0.01 6.28 0.37 32.40 1 3183 3576
x[43] -5.82 0.00 6.83 0.53 101.33 1 3990 3900
x[44] -6.80 -0.01 6.59 -28.81 1395.84 1 3999 3773
x[45] -6.36 0.01 6.32 0.53 118.09 1 3918 3884
x[46] -6.68 0.00 6.36 -0.30 21.96 1 3933 3950
x[47] -7.06 0.02 6.35 -0.67 82.81 1 3790 3642
x[48] -6.23 0.00 6.56 -0.31 27.62 1 4088 3599
x[49] -6.75 0.00 6.15 -3.89 175.81 1 4186 3831
x[50] -5.95 0.02 5.97 -0.39 78.22 1 3898 3910
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
which_min_ess <- which.min(mon[1:50, 'Bulk_ESS'])
The overall results look very promising with Rhats = 1 and ESS values close to the total number of draws of 4000. We take a detailed look at x[42], which has the smallest bulk-ESS of 3183.
We examine the sampling efficiency in different parts of the posterior by computing the efficiency estimates for small interval probability estimates.
plot_local_ess(fit = samp_jags_nom, par = which_min_ess, nalpha = 20)
The efficiency estimate is good in all parts of the posterior. Further, we examine the sampling efficiency of different quantile estimates.
plot_quantile_ess(fit = samp_jags_nom, par = which_min_ess, nalpha = 40)
Rank plots also look rather similar across chains.
xmin <- paste0("x[", which_min_ess, "]")
mcmc_hist_r_scale(samp_jags_nom[, , xmin])
Result: Jags seems to be able to sample from the nominal parameterization of the Cauchy distribution just fine.
We continue with our discussion about hierarchical models on the Eight Schools data, which we started in Section Eight Schools. We also analyse the performance of different variants of the diagnostics.
writeLines(readLines("eight_schools_cp.stan"))
data {
int<lower=0> J;
real y[J];
real<lower=0> sigma[J];
}
parameters {
real mu;
real<lower=0> tau;
real theta[J];
}
model {
mu ~ normal(0, 5);
tau ~ cauchy(0, 5);
theta ~ normal(mu, tau);
y ~ normal(theta, sigma);
}
In the main text, we observed that the centered parameterization of this hierarchical model did not work well with the default MCMC options of Stan plus increased adapt_delta, and so we directly try to fit the model with longer chains.
Low efficiency can be sometimes compensated with longer chains. Let’s check 10 times longer chain.
fit_cp2 <- stan(
file = 'eight_schools_cp.stan', data = eight_schools,
iter = 20000, chains = 4, seed = 483892929, refresh = 0,
control = list(adapt_delta = 0.95)
)
Warning: There were 736 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
monitor(fit_cp2)
Inference for the input samples (4 chains: each with iter = 20000; warmup = 10000):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
mu -0.99 4.43 9.73 4.39 3.29 1.00 4792 10361
tau 0.49 2.95 10.04 3.81 3.22 1.01 826 362
theta[1] -1.43 5.69 16.40 6.31 5.68 1.00 6726 12088
theta[2] -2.43 4.85 12.71 4.93 4.72 1.00 8847 16187
theta[3] -5.02 4.15 11.97 3.90 5.37 1.00 8240 13852
theta[4] -2.72 4.75 12.65 4.79 4.82 1.00 7954 15676
theta[5] -4.45 3.83 10.65 3.56 4.68 1.00 7130 14564
theta[6] -4.17 4.20 11.73 4.04 4.92 1.00 8505 14918
theta[7] -0.81 5.92 15.58 6.44 5.15 1.00 5886 13511
theta[8] -3.40 4.80 13.57 4.86 5.43 1.00 8342 14089
lp__ -25.06 -15.17 -2.25 -14.64 6.82 1.01 844 486
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
res <- monitor_extra(fit_cp2)
print(res)
Inference for the input samples (4 chains: each with iter = 20000; warmup = 10000):
mean se_mean sd Q5 Q50 Q95 seff reff sseff zseff zsseff zsreff Rhat sRhat
mu 4.39 0.05 3.29 -0.99 4.43 9.73 4752 0.12 4798 4747 4792 0.12 1 1.00
tau 3.81 0.07 3.22 0.49 2.95 10.04 2034 0.05 2044 809 826 0.02 1 1.00
theta[1] 6.31 0.07 5.68 -1.43 5.69 16.40 7272 0.18 7281 6701 6726 0.17 1 1.00
theta[2] 4.93 0.05 4.72 -2.43 4.85 12.71 9695 0.24 9780 8530 8847 0.22 1 1.00
theta[3] 3.90 0.05 5.37 -5.02 4.15 11.97 9998 0.25 10062 8103 8240 0.21 1 1.00
theta[4] 4.79 0.05 4.82 -2.72 4.75 12.65 9066 0.23 9183 7983 7954 0.20 1 1.00
theta[5] 3.56 0.05 4.68 -4.45 3.83 10.65 8240 0.21 8209 7111 7130 0.18 1 1.00
theta[6] 4.04 0.05 4.92 -4.17 4.20 11.73 9595 0.24 9566 8410 8505 0.21 1 1.00
theta[7] 6.44 0.07 5.15 -0.81 5.92 15.58 6260 0.16 6337 5796 5886 0.15 1 1.00
theta[8] 4.86 0.05 5.43 -3.40 4.80 13.57 10213 0.26 10300 8295 8342 0.21 1 1.00
lp__ -14.64 0.25 6.82 -25.06 -15.17 -2.25 764 0.02 781 823 844 0.02 1 1.01
zRhat zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff madsseff madsreff
mu 1 1.00 1 8921 0.22 10361 0.26 4529 0.11 7922 0.20
tau 1 1.01 1 7088 0.18 362 0.01 2254 0.06 6200 0.16
theta[1] 1 1.00 1 4139 0.10 12088 0.30 5826 0.15 7803 0.20
theta[2] 1 1.00 1 7238 0.18 16187 0.40 5989 0.15 7661 0.19
theta[3] 1 1.00 1 8034 0.20 13852 0.35 4821 0.12 7516 0.19
theta[4] 1 1.00 1 7705 0.19 15676 0.39 4882 0.12 7965 0.20
theta[5] 1 1.00 1 9193 0.23 14564 0.36 4277 0.11 8038 0.20
theta[6] 1 1.00 1 9230 0.23 14918 0.37 5113 0.13 7868 0.20
theta[7] 1 1.00 1 6516 0.16 13511 0.34 5453 0.14 7843 0.20
theta[8] 1 1.00 1 6585 0.16 14089 0.35 5662 0.14 7462 0.19
lp__ 1 1.01 1 1477 0.04 486 0.01 2082 0.05 4917 0.12
We still get a whole bunch of divergent transitions so it’s clear that the results can’t be trusted even if all other diagnostics were good. Still, it may be worth looking at additional diagnostics to better understand what’s happening.
Some rank-normalized split-Rhats are still larger than \(1.01\). Bulk-ESS for tau and lp__ are around 800 which corresponds to low relative efficiency of \(1\%\), but is above our recommendation of ESS>400. In this kind of cases, it is useful to look at the local efficiency estimates, too (and the larger number of divergences is clear indication of problems, too).
We examine the sampling efficiency in different parts of the posterior by computing the effective sample size for small intervals for tau.
plot_local_ess(fit = fit_cp2, par = "tau", nalpha = 50)
We see that the sampling has difficulties in exploring small tau values. As ESS<400 for small probability intervals in case of small tau values, we may suspect that we may miss substantial amount of posterior mass and get biased estimates.
We also examine the effective sample size of different quantile estimates.
plot_quantile_ess(fit = fit_cp2, par = "tau", nalpha = 100)
Several quantile estimates have ESS<400, which raises a doubt that there are convergence problems and we may have significant bias.
Let’s see how the Bulk-ESS and Tail-ESS changes when we use more and more draws.
plot_change_ess(fit = fit_cp2, par = "tau")
We see that given recommendation that Bulk-ESS>400 and Tail-ESS>400, they are not sufficient to detect convergence problems in this case, even the tail quantile estimates are able to detect these problems.
The rank plot visualisation of tau also shows clear sticking and mixing problems.
samp_cp2 <- as.array(fit_cp2)
mcmc_hist_r_scale(samp_cp2[, , "tau"])
Similar results are obtained for lp__, which is closely connected to tau for this model.
mcmc_hist_r_scale(samp_cp2[, , "lp__"])
We may also examine small interval efficiencies for mu.
plot_local_ess(fit = fit_cp2, par = "mu", nalpha = 50)
There are gaps of poor efficiency which again indicates problems in the mixing of the chains. However, these problems do not occur for any specific range of values of mu as was the case for tau. This tells us that it’s probably not mu with which the sampler has problems, but more likely tau or a related quantity.
As we observed divergences, we shouldn’t trust any Monte Carlo standard error (MCSE) estimates as they are likely biased, as well. However, for illustration purposes, we compute the MCSE, tail quantiles and corresponding effective sample sizes for the median of mu and tau. Comparing to the shorter MCMC run, using 10 times more draws has not reduced the MCSE to one third as would be expected without problems in the mixing of the chains.
round(quantile_mcse(samp_cp2[ , , "mu"], prob = 0.5), 2)
mcse Q05 Q95 Seff
1 0.06 4.32 4.53 4528.59
round(quantile_mcse(samp_cp2[ , , "tau"], prob = 0.5), 2)
mcse Q05 Q95 Seff
1 0.07 2.83 3.07 2254.07
For further evidence, let’s check 100 times longer chains than the default. This is not something we would recommend doing in practice, as it is not able to solve any problems with divergences as illustrated below.
fit_cp3 <- stan(
file = 'eight_schools_cp.stan', data = eight_schools,
iter = 200000, chains = 4, seed = 483892929, refresh = 0,
control = list(adapt_delta = 0.95)
)
Warning: There were 9955 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
monitor(fit_cp3)
Inference for the input samples (4 chains: each with iter = 2e+05; warmup = 1e+05):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
mu -1.11 4.41 10.11 4.45 3.44 1 3893 1359
tau 0.44 2.91 10.03 3.77 3.21 1 1863 419
theta[1] -1.57 5.78 16.33 6.36 5.70 1 10743 92177
theta[2] -2.54 4.90 13.21 5.02 4.81 1 8233 1505
theta[3] -5.00 4.16 12.35 3.97 5.42 1 7541 1484
theta[4] -2.99 4.73 13.08 4.81 4.92 1 8938 1496
theta[5] -4.55 3.83 11.14 3.63 4.81 1 6024 1459
theta[6] -4.19 4.19 12.00 4.07 4.98 1 6927 1460
theta[7] -1.04 5.96 15.62 6.45 5.20 1 9393 74195
theta[8] -3.47 4.80 13.58 4.92 5.44 1 10756 78151
lp__ -24.94 -15.07 -2.28 -14.54 6.82 1 5146 2748
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
res <- monitor_extra(fit_cp3)
print(res)
Inference for the input samples (4 chains: each with iter = 2e+05; warmup = 1e+05):
mean se_mean sd Q5 Q50 Q95 seff reff sseff zseff zsseff zsreff Rhat sRhat
mu 4.45 0.06 3.44 -1.11 4.41 10.11 3376 0.01 3341 3934 3893 0.01 1 1
tau 3.77 0.03 3.21 0.44 2.91 10.03 11675 0.03 11652 1934 1863 0.00 1 1
theta[1] 6.36 0.05 5.70 -1.57 5.78 16.33 13093 0.03 12903 10874 10743 0.03 1 1
theta[2] 5.02 0.05 4.81 -2.54 4.90 13.21 8225 0.02 8117 8347 8233 0.02 1 1
theta[3] 3.97 0.06 5.42 -5.00 4.16 12.35 8445 0.02 8318 7627 7541 0.02 1 1
theta[4] 4.81 0.05 4.92 -2.99 4.73 13.08 9207 0.02 9141 9000 8938 0.02 1 1
theta[5] 3.63 0.06 4.81 -4.55 3.83 11.14 6505 0.02 6443 6086 6024 0.02 1 1
theta[6] 4.07 0.06 4.98 -4.19 4.19 12.00 7329 0.02 7238 7018 6927 0.02 1 1
theta[7] 6.45 0.05 5.20 -1.04 5.96 15.62 9861 0.02 9769 9479 9393 0.02 1 1
theta[8] 4.92 0.05 5.44 -3.47 4.80 13.58 12287 0.03 12205 10838 10756 0.03 1 1
lp__ -14.54 0.10 6.82 -24.94 -15.07 -2.28 4936 0.01 4929 5153 5146 0.01 1 1
zRhat zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff madsseff madsreff
mu 1 1 1 3154 0.01 1359 0.00 18360 0.05 14584 0.04
tau 1 1 1 36641 0.09 419 0.00 14003 0.04 15028 0.04
theta[1] 1 1 1 17072 0.04 92177 0.23 17730 0.04 15418 0.04
theta[2] 1 1 1 10355 0.03 1505 0.00 17620 0.04 14264 0.04
theta[3] 1 1 1 11243 0.03 1484 0.00 18993 0.05 13741 0.03
theta[4] 1 1 1 11852 0.03 1496 0.00 18121 0.05 14711 0.04
theta[5] 1 1 1 8373 0.02 1459 0.00 18764 0.05 13977 0.03
theta[6] 1 1 1 9961 0.02 1460 0.00 17788 0.04 13662 0.03
theta[7] 1 1 1 12276 0.03 74195 0.19 17502 0.04 16218 0.04
theta[8] 1 1 1 16739 0.04 78151 0.20 18122 0.05 15442 0.04
lp__ 1 1 1 6866 0.02 2748 0.01 13296 0.03 16026 0.04
Rhat, Bulk-ESS and Tail-ESS are not able to detect problems, although Tail-ESS for tau is suspiciously low compared to total number of draws.
plot_local_ess(fit = fit_cp3, par = "tau", nalpha = 100)
plot_quantile_ess(fit = fit_cp3, par = "tau", nalpha = 100)
And the rank plots of tau also show sticking and mixing problems for small values of tau.
samp_cp3 <- as.array(fit_cp3)
mcmc_hist_r_scale(samp_cp3[, , "tau"])
What we do see is an advantage of rank plots over trace plots as even with 100000 draws per chain, rank plots don’t get crowded and the mixing problems of chains is still easy to see.
With centered parameterization the mean estimate tends to get smaller with more draws. With 400000 draws using the centered parameterization the mean estimate is 3.77 (se 0.03). With 40000 draws using the non-centered parameterization the mean estimate is 3.6 (se 0.02). The difference is more than 8 sigmas. We are able to see the convergence problems in the centered parameterization case, if we do look carefully (or use divergence diagnostic ), but we do see that Rhat, Bulk-ESS, Tail-ESS and Monte Carlo error estimates for the mean can’t be trusted if other diagnostics indicate convergence problems!
When autocorrelation time is high, it has been common to thin the chains by saving only a small portion of the draws. This will throw away useful information also for convergence diagnostics. With 400000 iterations per chain, thinning of 200 and 4 chains, we again end up with 4000 iterations as with the default settings.
fit_cp4 <- stan(
file = 'eight_schools_cp.stan', data = eight_schools,
iter = 400000, thin = 200, chains = 4, seed = 483892929, refresh = 0,
control = list(adapt_delta = 0.95)
)
Warning: There were 74 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
We observe several divergent transitions and the estimated Bayesian fraction of missing information is also low, which indicate convergence problems and potentially biased estimates.
Unfortunately the thinning makes Rhat and ESS estimates to miss the problems. The posterior mean is still biased, being more than 3 sigmas away from the estimate obtained using non-centered parameterization.
monitor(fit_cp4)
Inference for the input samples (4 chains: each with iter = 4e+05; warmup = 2e+05):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
mu -1.15 4.29 9.84 4.30 3.33 1 4086 3773
tau 0.43 2.93 9.75 3.70 3.12 1 2576 1712
theta[1] -1.74 5.43 16.44 6.04 5.55 1 4091 3888
theta[2] -2.25 4.73 13.00 4.97 4.76 1 4064 3647
theta[3] -4.96 4.10 12.11 3.85 5.34 1 4286 3973
theta[4] -3.12 4.58 12.54 4.68 4.80 1 4055 3880
theta[5] -4.51 3.68 10.70 3.47 4.72 1 4063 3524
theta[6] -4.19 4.15 11.75 3.95 4.94 1 3969 3787
theta[7] -1.22 5.75 15.44 6.23 5.08 1 3882 3533
theta[8] -3.91 4.68 13.54 4.80 5.40 1 3973 3813
lp__ -24.91 -15.15 -1.17 -14.29 7.11 1 2672 1759
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
res <- monitor_extra(fit_cp4)
print(res)
Inference for the input samples (4 chains: each with iter = 4e+05; warmup = 2e+05):
mean se_mean sd Q5 Q50 Q95 seff reff sseff zseff zsseff zsreff Rhat sRhat
mu 4.30 0.05 3.33 -1.15 4.29 9.84 4075 1.02 4085 4077 4086 1.02 1 1
tau 3.70 0.05 3.12 0.43 2.93 9.75 3339 0.83 3395 2577 2576 0.64 1 1
theta[1] 6.04 0.09 5.55 -1.74 5.43 16.44 4027 1.01 4036 4081 4091 1.02 1 1
theta[2] 4.97 0.07 4.76 -2.25 4.73 13.00 4046 1.01 4055 4055 4064 1.02 1 1
theta[3] 3.85 0.08 5.34 -4.96 4.10 12.11 4329 1.08 4340 4275 4286 1.07 1 1
theta[4] 4.68 0.08 4.80 -3.12 4.58 12.54 4013 1.00 4047 4020 4055 1.01 1 1
theta[5] 3.47 0.07 4.72 -4.51 3.68 10.70 3989 1.00 4020 4003 4063 1.02 1 1
theta[6] 3.95 0.08 4.94 -4.19 4.15 11.75 3727 0.93 3802 3930 3969 0.99 1 1
theta[7] 6.23 0.08 5.08 -1.22 5.75 15.44 3891 0.97 3899 3874 3882 0.97 1 1
theta[8] 4.80 0.09 5.40 -3.91 4.68 13.54 3987 1.00 4019 3942 3973 0.99 1 1
lp__ -14.29 0.14 7.11 -24.91 -15.15 -1.17 2570 0.64 2570 2668 2672 0.67 1 1
zRhat zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff madsseff madsreff
mu 1 1 1 4029 1.01 3773 0.94 4102 1.03 3761 0.94
tau 1 1 1 3817 0.95 1712 0.43 3754 0.94 4101 1.03
theta[1] 1 1 1 3693 0.92 3888 0.97 4130 1.03 3987 1.00
theta[2] 1 1 1 3937 0.98 3647 0.91 4084 1.02 3582 0.90
theta[3] 1 1 1 4033 1.01 3973 0.99 3681 0.92 3852 0.96
theta[4] 1 1 1 3908 0.98 3880 0.97 3816 0.95 3537 0.88
theta[5] 1 1 1 3695 0.92 3524 0.88 3808 0.95 3491 0.87
theta[6] 1 1 1 3725 0.93 3787 0.95 3880 0.97 3731 0.93
theta[7] 1 1 1 3281 0.82 3533 0.88 3978 0.99 3934 0.98
theta[8] 1 1 1 3717 0.93 3813 0.95 3900 0.98 3765 0.94
lp__ 1 1 1 2949 0.74 1759 0.44 3682 0.92 3788 0.95
Various diagnostic plots of tau look reasonable as well.
plot_local_ess(fit = fit_cp4, par = "tau", nalpha = 100)
plot_quantile_ess(fit = fit_cp4, par = "tau", nalpha = 100)
plot_change_ess(fit = fit_cp4, par = "tau")
However, the rank plots seem still to show the problem.
samp_cp4 <- as.array(fit_cp4)
mcmc_hist_r_scale(samp_cp4[, , "tau"])
In the following, we want to expand our understanding of the non-centered parameterization of the hierarchical model fit to the eight schools data.
writeLines(readLines("eight_schools_ncp.stan"))
data {
int<lower=0> J;
real y[J];
real<lower=0> sigma[J];
}
parameters {
real mu;
real<lower=0> tau;
real theta_tilde[J];
}
transformed parameters {
real theta[J];
for (j in 1:J)
theta[j] = mu + tau * theta_tilde[j];
}
model {
mu ~ normal(0, 5);
tau ~ cauchy(0, 5);
theta_tilde ~ normal(0, 1);
y ~ normal(theta, sigma);
}
In the main text, we have already seen that the non-centered parameterization works better than the centered parameterization, at least when we use an increased adapt_delta value. Let’s see what happens when using the default MCMC option of Stan.
fit_ncp <- stan(
file = 'eight_schools_ncp.stan', data = eight_schools,
iter = 2000, chains = 4, seed = 483892929, refresh = 0
)
We observe a few divergent transitions with the default of adapt_delta=0.8. Let’s analyze the sample.
monitor(fit_ncp)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
mu -1.13 4.49 9.74 4.41 3.33 1.00 4746 2365
tau 0.27 2.67 9.66 3.50 3.08 1.00 2563 1951
theta_tilde[1] -1.38 0.33 1.96 0.31 1.01 1.00 5252 2900
theta_tilde[2] -1.47 0.11 1.63 0.09 0.94 1.00 4927 3128
theta_tilde[3] -1.67 -0.09 1.54 -0.07 0.98 1.00 5741 2899
theta_tilde[4] -1.49 0.03 1.59 0.04 0.94 1.00 4846 3075
theta_tilde[5] -1.65 -0.17 1.38 -0.16 0.92 1.01 4355 2935
theta_tilde[6] -1.64 -0.11 1.45 -0.09 0.94 1.00 4063 2533
theta_tilde[7] -1.25 0.35 1.87 0.33 0.96 1.00 4681 2836
theta_tilde[8] -1.50 0.08 1.63 0.08 0.98 1.00 5204 2647
theta[1] -1.54 5.49 15.80 6.12 5.53 1.00 4567 3124
theta[2] -2.69 4.83 12.60 4.89 4.67 1.00 5586 3091
theta[3] -4.50 4.18 11.89 4.00 5.15 1.00 4465 2888
theta[4] -2.82 4.61 12.21 4.69 4.73 1.00 4209 2879
theta[5] -4.03 3.96 10.70 3.70 4.57 1.00 4656 3220
theta[6] -4.12 4.17 11.45 3.99 4.75 1.00 5241 3157
theta[7] -0.83 5.80 15.09 6.26 5.04 1.00 4055 3041
theta[8] -3.15 4.81 13.49 4.87 5.21 1.00 4817 2939
lp__ -11.33 -6.65 -3.77 -6.97 2.33 1.00 1525 2496
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
res <- monitor_extra(fit_ncp)
print(res)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):
mean se_mean sd Q5 Q50 Q95 seff reff sseff zseff zsseff zsreff Rhat
mu 4.41 0.05 3.33 -1.13 4.49 9.74 4646 1.16 4709 4682 4746 1.19 1
tau 3.50 0.06 3.08 0.27 2.67 9.66 2857 0.71 2880 2529 2563 0.64 1
theta_tilde[1] 0.31 0.01 1.01 -1.38 0.33 1.96 5241 1.31 5257 5236 5252 1.31 1
theta_tilde[2] 0.09 0.01 0.94 -1.47 0.11 1.63 4830 1.21 4872 4883 4927 1.23 1
theta_tilde[3] -0.07 0.01 0.98 -1.67 -0.09 1.54 5687 1.42 5754 5674 5741 1.44 1
theta_tilde[4] 0.04 0.01 0.94 -1.49 0.03 1.59 4794 1.20 4845 4793 4846 1.21 1
theta_tilde[5] -0.16 0.01 0.92 -1.65 -0.17 1.38 4286 1.07 4357 4285 4355 1.09 1
theta_tilde[6] -0.09 0.01 0.94 -1.64 -0.11 1.45 4025 1.01 4058 4031 4063 1.02 1
theta_tilde[7] 0.33 0.01 0.96 -1.25 0.35 1.87 4600 1.15 4685 4596 4681 1.17 1
theta_tilde[8] 0.08 0.01 0.98 -1.50 0.08 1.63 5147 1.29 5194 5163 5204 1.30 1
theta[1] 6.12 0.08 5.53 -1.54 5.49 15.80 4262 1.07 4289 4532 4567 1.14 1
theta[2] 4.89 0.06 4.67 -2.69 4.83 12.60 5297 1.32 5305 5577 5586 1.40 1
theta[3] 4.00 0.08 5.15 -4.50 4.18 11.89 4063 1.02 4141 4375 4465 1.12 1
theta[4] 4.69 0.08 4.73 -2.82 4.61 12.21 3850 0.96 3973 4064 4209 1.05 1
theta[5] 3.70 0.07 4.57 -4.03 3.96 10.70 4371 1.09 4471 4554 4656 1.16 1
theta[6] 3.99 0.07 4.75 -4.12 4.17 11.45 4995 1.25 5000 5235 5241 1.31 1
theta[7] 6.26 0.08 5.04 -0.83 5.80 15.09 3947 0.99 3972 4029 4055 1.01 1
theta[8] 4.87 0.08 5.21 -3.15 4.81 13.49 4537 1.13 4536 4799 4817 1.20 1
lp__ -6.97 0.06 2.33 -11.33 -6.65 -3.77 1425 0.36 1462 1485 1525 0.38 1
sRhat zRhat zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff
mu 1 1 1 1.00 2017 0.50 2365 0.59 4922 1.23
tau 1 1 1 1.00 3010 0.75 1951 0.49 3342 0.84
theta_tilde[1] 1 1 1 1.00 1979 0.49 2900 0.72 4884 1.22
theta_tilde[2] 1 1 1 1.00 2048 0.51 3128 0.78 4904 1.23
theta_tilde[3] 1 1 1 1.00 1877 0.47 2899 0.72 5215 1.30
theta_tilde[4] 1 1 1 1.00 2167 0.54 3075 0.77 4697 1.17
theta_tilde[5] 1 1 1 1.01 2074 0.52 2935 0.73 3959 0.99
theta_tilde[6] 1 1 1 1.00 2001 0.50 2533 0.63 4419 1.10
theta_tilde[7] 1 1 1 1.00 1969 0.49 2836 0.71 4776 1.19
theta_tilde[8] 1 1 1 1.00 1813 0.45 2647 0.66 4821 1.21
theta[1] 1 1 1 1.00 2611 0.65 3124 0.78 4468 1.12
theta[2] 1 1 1 1.00 2318 0.58 3091 0.77 5198 1.30
theta[3] 1 1 1 1.00 2486 0.62 2888 0.72 5213 1.30
theta[4] 1 1 1 1.00 2273 0.57 2879 0.72 4908 1.23
theta[5] 1 1 1 1.00 2250 0.56 3220 0.80 5000 1.25
theta[6] 1 1 1 1.00 2243 0.56 3157 0.79 5031 1.26
theta[7] 1 1 1 1.00 2551 0.64 3041 0.76 4630 1.16
theta[8] 1 1 1 1.00 2224 0.56 2939 0.73 4627 1.16
lp__ 1 1 1 1.00 2456 0.61 2496 0.62 2107 0.53
madsseff madsreff
mu 2143 0.54
tau 2936 0.73
theta_tilde[1] 2293 0.57
theta_tilde[2] 2124 0.53
theta_tilde[3] 2229 0.56
theta_tilde[4] 2463 0.62
theta_tilde[5] 2206 0.55
theta_tilde[6] 2532 0.63
theta_tilde[7] 2522 0.63
theta_tilde[8] 2193 0.55
theta[1] 2695 0.67
theta[2] 2641 0.66
theta[3] 2625 0.66
theta[4] 2558 0.64
theta[5] 2596 0.65
theta[6] 2414 0.60
theta[7] 2828 0.71
theta[8] 2587 0.65
lp__ 3014 0.75
All Rhats are close to 1, and ESSs are good despite a few divergent transitions. Small interval and quantile plots of tau reveal some sampling problems for small tau values, but not nearly as strong as for the centered parameterization.
plot_local_ess(fit = fit_ncp, par = "tau", nalpha = 20)
plot_quantile_ess(fit = fit_ncp, par = "tau", nalpha = 40)
Overall, the non-centered parameterization looks good even for the default settings of adapt_delta, and increasing it to 0.95 gets rid of the last remaining problems. This stands in sharp contrast to what we observed for the centered parameterization, where increasing adapt_delta didn’t help at all. Actually, this is something we observe quite often: A suboptimal parameterization can cause problems that are not simply solved by tuning the sampler. Instead, we have to adjust our model to achieve trustworthy inference.
We will also run the centered and non-centered parameterizations of the eight schools model with Jags.
The Jags code for the centered eight schools model looks as follows:
writeLines(readLines("eight_schools_cp.bugs"))
model {
for (j in 1:J) {
sigma_prec[j] <- pow(sigma[j], -2)
theta[j] ~ dnorm(mu, tau_prec)
y[j] ~ dnorm(theta[j], sigma_prec[j])
}
mu ~ dnorm(0, pow(5, -2))
tau ~ dt(0, pow(5, -2), 1)T(0, )
tau_prec <- pow(tau, -2)
}
First, we initialize the Jags model for reusage later.
jags_cp <- jags.model(
"eight_schools_cp.bugs",
data = eight_schools,
n.chains = 4, n.adapt = 10000
)
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 8
Unobserved stochastic nodes: 10
Total graph size: 40
Initializing model
Next, we sample 1000 iterations for each of the 4 chains for easy comparison with the corresponding Stan results.
samp_jags_cp <- coda.samples(
jags_cp, c("theta", "mu", "tau"),
n.iter = 1000
)
samp_jags_cp <- aperm(abind(samp_jags_cp, along = 3), c(1, 3, 2))
Convergence diagnostics indicate problems in the sampling of mu and tau, but also to a lesser degree in all other paramters.
mon <- monitor(samp_jags_cp)
print(mon)
Inference for the input samples (4 chains: each with iter = 1000; warmup = 0):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
mu -0.76 4.48 10.53 4.64 3.47 1.05 132 204
tau 0.28 2.96 10.22 3.94 3.89 1.08 64 92
theta[1] -1.24 6.16 16.96 6.67 5.91 1.05 156 258
theta[2] -2.73 5.10 13.50 5.19 4.96 1.03 242 409
theta[3] -5.01 4.25 12.60 4.18 5.46 1.03 281 317
theta[4] -2.92 5.02 13.36 5.09 5.02 1.03 243 423
theta[5] -4.90 3.87 11.26 3.66 4.95 1.03 217 295
theta[6] -3.80 4.34 12.67 4.40 5.00 1.03 245 388
theta[7] -0.67 6.24 16.18 6.76 5.29 1.05 130 361
theta[8] -3.42 4.95 14.63 5.15 5.71 1.03 254 551
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
We also see problems in the sampling of tau using various diagnostic plots.
plot_local_ess(samp_jags_cp, par = "tau", nalpha = 20)
plot_quantile_ess(samp_jags_cp, par = "tau", nalpha = 20)
plot_change_ess(samp_jags_cp, par = "tau")
Let’s see what happens if we run 10 times longer chains.
samp_jags_cp <- coda.samples(
jags_cp, c("theta", "mu", "tau"),
n.iter = 10000
)
samp_jags_cp <- aperm(abind(samp_jags_cp, along = 3), c(1, 3, 2))
Convergence looks better now, although tau is still estimated not very efficiently.
mon <- monitor(samp_jags_cp)
print(mon)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
mu -1.01 4.33 9.89 4.42 3.30 1.00 1166 1819
tau 0.23 2.81 10.10 3.68 3.36 1.01 622 451
theta[1] -1.45 5.67 16.32 6.31 5.68 1.00 1957 3387
theta[2] -2.44 4.81 12.71 4.98 4.70 1.00 2134 5179
theta[3] -4.92 4.05 11.75 3.90 5.32 1.00 2185 6508
theta[4] -2.89 4.66 12.61 4.79 4.84 1.00 2087 7093
theta[5] -4.32 3.76 10.78 3.65 4.68 1.00 1970 5001
theta[6] -4.05 4.13 11.59 4.07 4.87 1.00 2138 7115
theta[7] -0.87 5.85 15.50 6.39 5.10 1.00 1792 2721
theta[8] -3.26 4.71 13.53 4.90 5.33 1.00 2357 6810
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
The diagnostic plots of quantiles and small intervals tell a similar story.
plot_local_ess(samp_jags_cp, par = "tau", nalpha = 20)
plot_quantile_ess(samp_jags_cp, par = "tau", nalpha = 20)
Notably, however, the increase in effective sample size of tau is linear in the total number of draws indicating that convergence for tau may be achieved by simply running longer chains.
plot_change_ess(samp_jags_cp, par = "tau")
Result: Similar to Stan, Jags also has convergence problems with the centered parameterization of the eight schools model.
The Jags code for the non-centered eight schools model looks as follows:
writeLines(readLines("eight_schools_ncp.bugs"))
model {
for (j in 1:J) {
sigma_prec[j] <- pow(sigma[j], -2)
theta_tilde[j] ~ dnorm(0, 1)
theta[j] = mu + tau * theta_tilde[j]
y[j] ~ dnorm(theta[j], sigma_prec[j])
}
mu ~ dnorm(0, pow(5, -2))
tau ~ dt(0, pow(5, -2), 1)T(0, )
}
First, we initialize the Jags model for reusage later.
jags_ncp <- jags.model(
"eight_schools_ncp.bugs",
data = eight_schools,
n.chains = 4, n.adapt = 10000
)
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 8
Unobserved stochastic nodes: 10
Total graph size: 55
Initializing model
Next, we sample 1000 iterations for each of the 4 chains for easy comparison with the corresponding Stan results.
samp_jags_ncp <- coda.samples(
jags_ncp, c("theta", "mu", "tau"),
n.iter = 1000
)
samp_jags_ncp <- aperm(abind(samp_jags_ncp, along = 3), c(1, 3, 2))
Convergence diagnostics indicate much better mixing than for the centered eight school model.
mon <- monitor(samp_jags_ncp)
print(mon)
Inference for the input samples (4 chains: each with iter = 1000; warmup = 0):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
mu -1.25 4.33 9.66 4.25 3.34 1 2881 3054
tau 0.25 2.73 9.55 3.51 3.02 1 1013 1185
theta[1] -1.70 5.46 15.38 5.99 5.38 1 2845 2373
theta[2] -2.58 4.74 12.29 4.77 4.70 1 3915 3457
theta[3] -4.92 3.97 11.32 3.74 5.14 1 3367 2748
theta[4] -2.94 4.56 12.22 4.56 4.72 1 3691 3797
theta[5] -4.55 3.83 10.68 3.52 4.64 1 3673 3050
theta[6] -4.37 4.21 11.40 3.96 4.90 1 3644 3241
theta[7] -0.93 5.70 15.04 6.17 4.98 1 3082 3148
theta[8] -3.73 4.53 12.86 4.58 5.20 1 3467 3136
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
Specifically, the mixing of tau looks much better although we still see some problems in the estimation of larger quantiles.
plot_local_ess(samp_jags_ncp, par = "tau", nalpha = 20)
plot_quantile_ess(samp_jags_ncp, par = "tau", nalpha = 20)
Change in effective sample size is roughly linear indicating that some remaining convergence problems are likely to be solved by running longer chains.
plot_change_ess(samp_jags_ncp, par = "tau")
Result: Similar to Stan, Jags can sample from the non-centered parameterization of the eight schools model much better than from the centered parameterization.
We will illustrate the rank normalization with a few examples. First, we plot histograms, and empirical cumulative distribution functions (ECDF) with respect to the original parameter values (\(\theta\)), scaled ranks (ranks divided by the maximum rank), and rank normalized values (z). We used scaled ranks to make the plots look similar for different number of draws.
100 draws from Normal(0, 1):
n <- 100
theta <- rnorm(n)
plot_ranknorm(theta, n)
100 draws from Exponential(1):
theta <- rexp(n)
plot_ranknorm(theta, n)
100 draws from Cauchy(0, 1):
theta <- rcauchy(n)
plot_ranknorm(theta, n)
In the above plots, the ECDF with respect to scaled rank and rank normalized \(z\)-values look exactly the same for all distributions. In Split-\(\widehat{R}\) and effective sample size computations, we rank all draws jointly, but then compare ranks and ECDF of individual split chains. To illustrate the variation between chains, we draw 8 batches of 100 draws each from Normal(0, 1):
n <- 100
m <- 8
theta <- rnorm(n * m)
plot_ranknorm(theta, n, m)
The variation in ECDF due to the variation ranks is now visible also in scaled ranks and rank normalized \(z\)-values from different batches (chains).
The benefit of rank normalization is more obvious for non-normal distribution such as Cauchy:
theta <- rcauchy(n * m)
plot_ranknorm(theta, n, m)
Rank normalization makes the subsequent computations well defined and invariant under bijective transformations. This means that we get the same results, for example, if we use unconstrained or constrained parameterisations in a model.
In the paper, we had defined the empirical CDF (ECDF) for any \(\theta_\alpha\) as \[ p(\theta \leq \theta_\alpha) \approx \bar{I}_\alpha = \frac{1}{S}\sum_{s=1}^S I(\theta^{(s)} \leq\theta_\alpha), \]
For independent draws, \(\bar{I}_\alpha\) has a \({\rm Beta}(\bar{I}_\alpha+1, S - \bar{I}_\alpha + 1)\) distribution. Thus we can easily examine the variation of the ECDF for any \(\theta_\alpha\) value from a single chain. If \(\bar{I}_\alpha\) is not very close to \(1\) or \(S\) and \(S\) is large, we can use the variance of Beta distribution
\[ {\rm Var}[p(\theta \leq \theta_\alpha)] = \frac{(\bar{I}_\alpha+1)*(S-\bar{I}_\alpha+1)}{(S+2)^2(S+3)}. \] We illustrate uncertainty intervals of the Beta distribution and normal approximation of ECDF for 100 draws from Normal(0, 1):
n <- 100
m <- 1
theta <- rnorm(n * m)
plot_ranknorm(theta, n, m, interval = TRUE)
Uncertainty intervals of ECDF for draws from Cauchy(0, 1) illustrate again the improved visual clarity in plotting when using scaled ranks:
n <- 100
m <- 1
theta <- rcauchy(n * m)
plot_ranknorm(theta, n, m, interval = TRUE)
The above plots illustrate that the normal approximation is accurate for practical purposes in MCMC diagnostics.
We have already seen that the effective sample size of dynamic HMC can be higher than with independent draws. The next example illustrates interesting relative efficiency phenomena due to the properties of dynamic HMC algorithms.
We sample from a simple 16-dimensional standard normal model.
writeLines(readLines("normal.stan"))
data {
int<lower=1> J;
}
parameters {
vector[J] x;
}
model {
x ~ normal(0, 1);
}
fit_n <- stan(
file = 'normal.stan', data = data.frame(J = 16),
iter = 20000, chains = 4, seed = 483892929, refresh = 0
)
samp <- as.array(fit_n)
monitor(samp)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
x[1] -1.62 0.00 1.63 0.00 0.99 1 96981 28199
x[2] -1.64 0.01 1.66 0.00 1.00 1 104409 29458
x[3] -1.66 0.00 1.64 0.00 1.00 1 93302 28779
x[4] -1.65 0.00 1.66 0.00 1.01 1 101622 28272
x[5] -1.64 0.00 1.64 0.00 1.00 1 102670 27109
x[6] -1.65 0.01 1.64 0.00 1.00 1 93734 28755
x[7] -1.65 -0.01 1.63 -0.01 0.99 1 97602 29524
x[8] -1.63 0.00 1.64 0.00 1.00 1 99896 29687
x[9] -1.64 0.00 1.65 0.00 1.00 1 108129 29337
x[10] -1.63 0.00 1.63 0.00 0.99 1 99153 29951
x[11] -1.65 0.00 1.66 -0.01 1.00 1 101781 29480
x[12] -1.64 0.00 1.65 0.00 1.00 1 99202 30573
x[13] -1.65 0.01 1.66 0.01 1.00 1 102374 28030
x[14] -1.64 0.00 1.65 0.00 1.00 1 97512 30198
x[15] -1.66 0.00 1.66 0.00 1.01 1 100121 28814
x[16] -1.65 0.00 1.65 0.00 1.00 1 97788 29000
lp__ -13.17 -7.67 -3.95 -7.99 2.85 1 14195 20136
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
res <- monitor_extra(samp)
print(res)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):
mean se_mean sd Q5 Q50 Q95 seff reff sseff zseff zsseff zsreff Rhat sRhat
x[1] 0.00 0.00 0.99 -1.62 0.00 1.63 96791 2.42 97003 96771 96981 2.42 1 1
x[2] 0.00 0.00 1.00 -1.64 0.01 1.66 104176 2.60 104353 104233 104409 2.61 1 1
x[3] 0.00 0.00 1.00 -1.66 0.00 1.64 92850 2.32 93370 92782 93302 2.33 1 1
x[4] 0.00 0.00 1.01 -1.65 0.00 1.66 101100 2.53 101428 101293 101622 2.54 1 1
x[5] 0.00 0.00 1.00 -1.64 0.00 1.64 101814 2.55 102625 101855 102670 2.57 1 1
x[6] 0.00 0.00 1.00 -1.65 0.01 1.64 92945 2.32 93713 92968 93734 2.34 1 1
x[7] -0.01 0.00 0.99 -1.65 -0.01 1.63 97394 2.43 97642 97353 97602 2.44 1 1
x[8] 0.00 0.00 1.00 -1.63 0.00 1.64 99741 2.49 99949 99688 99896 2.50 1 1
x[9] 0.00 0.00 1.00 -1.64 0.00 1.65 107571 2.69 108141 107561 108129 2.70 1 1
x[10] 0.00 0.00 0.99 -1.63 0.00 1.63 98812 2.47 99128 98838 99153 2.48 1 1
x[11] -0.01 0.00 1.00 -1.65 0.00 1.66 101379 2.53 101861 101299 101781 2.54 1 1
x[12] 0.00 0.00 1.00 -1.64 0.00 1.65 98780 2.47 99163 98819 99202 2.48 1 1
x[13] 0.01 0.00 1.00 -1.65 0.01 1.66 102158 2.55 102428 102104 102374 2.56 1 1
x[14] 0.00 0.00 1.00 -1.64 0.00 1.65 97052 2.43 97341 97223 97512 2.44 1 1
x[15] 0.00 0.00 1.01 -1.66 0.00 1.66 100055 2.50 100178 100000 100121 2.50 1 1
x[16] 0.00 0.00 1.00 -1.65 0.00 1.65 97495 2.44 97773 97510 97788 2.44 1 1
lp__ -7.99 0.02 2.85 -13.17 -7.67 -3.95 14445 0.36 14453 14188 14195 0.35 1 1
zRhat zsRhat zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff madsseff madsreff
x[1] 1 1 1 16248 0.41 28199 0.70 77884 1.95 19350 0.48
x[2] 1 1 1 15968 0.40 29458 0.74 80243 2.01 19615 0.49
x[3] 1 1 1 16243 0.41 28779 0.72 76082 1.90 18875 0.47
x[4] 1 1 1 16328 0.41 28272 0.71 81090 2.03 19047 0.48
x[5] 1 1 1 16344 0.41 27109 0.68 80858 2.02 19267 0.48
x[6] 1 1 1 16759 0.42 28755 0.72 80715 2.02 19105 0.48
x[7] 1 1 1 16179 0.40 29524 0.74 79227 1.98 19179 0.48
x[8] 1 1 1 16874 0.42 29687 0.74 81168 2.03 19297 0.48
x[9] 1 1 1 16680 0.42 29337 0.73 77215 1.93 19715 0.49
x[10] 1 1 1 16923 0.42 29951 0.75 77581 1.94 19481 0.49
x[11] 1 1 1 16374 0.41 29480 0.74 79409 1.99 18625 0.47
x[12] 1 1 1 16865 0.42 30573 0.76 78582 1.96 19363 0.48
x[13] 1 1 1 15946 0.40 28030 0.70 77579 1.94 19711 0.49
x[14] 1 1 1 17369 0.43 30198 0.75 79292 1.98 20006 0.50
x[15] 1 1 1 16772 0.42 28814 0.72 80536 2.01 19647 0.49
x[16] 1 1 1 16817 0.42 29000 0.72 78882 1.97 19653 0.49
lp__ 1 1 1 21109 0.53 20136 0.50 16373 0.41 23827 0.60
The Bulk-ESS for all \(x\) is larger than 9.330210^{4}. However tail-ESS for all \(x\) is less than 3.057310^{4}. Further, bulk-ESS for lp__ is only 1.419510^{4}.
If we take a look at all the Stan examples in this notebook, we see that the bulk-ESS for lp__ is always below 0.5. This is because lp__ correlates strongly with the total energy in HMC, which is sampled using a random walk proposal once per iteration. Thus, it’s likely that lp__ has some random walk behavior, as well, leading to autocorrelation and a small relative efficiency. At the same time, adaptive HMC can create antithetic Markov chains which have negative auto-correlations at odd lags. This results in a bulk-ESS greater than S for some parameters.
Let’s check the effective sample size in different parts of the posterior by computing the effective sample size for small interval estimates for x[1].
plot_local_ess(fit_n, par = 1, nalpha = 100)
The effective sample size for probability estimate for a small interval is close to 1 with a slight drop in the tails. This is a good result, but far from the effective sample size for the bulk, mean, and median estimates. Let’s check the effective sample size for quantiles.
plot_quantile_ess(fit = fit_n, par = 1, nalpha = 100)
Central quantile estimates have higher effective sample size than tail quantile estimates.
The total energy of HMC should affect how far in the tails a chain in one iteration can go. Fat tails of the target have high energy, and thus only chains with high total energy can reach there. This will suggest that the random walk in total energy would cause random walk in the variance of \(x\). Let’s check the second moment of \(x\).
samp_x2 <- as.array(fit_n, pars = "x")^2
monitor(samp_x2)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
x[1] 0 0.45 3.81 0.99 1.41 1 16251 18659
x[2] 0 0.44 3.92 1.00 1.42 1 15983 18321
x[3] 0 0.45 3.89 1.00 1.42 1 16236 18502
x[4] 0 0.46 3.95 1.02 1.46 1 16334 18983
x[5] 0 0.44 3.80 0.99 1.41 1 16334 17507
x[6] 0 0.46 3.82 1.00 1.39 1 16741 19952
x[7] 0 0.44 3.82 0.99 1.39 1 16160 19410
x[8] 0 0.46 3.80 1.00 1.40 1 16879 19856
x[9] 0 0.45 3.84 1.00 1.42 1 16685 18873
x[10] 0 0.44 3.80 0.98 1.40 1 16898 18605
x[11] 0 0.47 3.86 1.01 1.40 1 16353 18630
x[12] 0 0.46 3.88 1.00 1.44 1 16870 18740
x[13] 0 0.46 3.84 1.01 1.41 1 16048 18402
x[14] 0 0.45 3.82 0.99 1.40 1 17376 19872
x[15] 0 0.45 3.90 1.01 1.44 1 16784 18286
x[16] 0 0.45 3.87 1.01 1.43 1 16795 19505
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (good values is
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
res <- monitor_extra(samp_x2)
print(res)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):
mean se_mean sd Q5 Q50 Q95 seff reff sseff zseff zsseff zsreff Rhat sRhat zRhat zsRhat
x[1] 0.99 0.01 1.41 0 0.45 3.81 15003 0.38 15032 16230 16251 0.41 1 1 1 1
x[2] 1.00 0.01 1.42 0 0.44 3.92 14175 0.35 14211 15949 15983 0.40 1 1 1 1
x[3] 1.00 0.01 1.42 0 0.45 3.89 14953 0.37 14953 16228 16236 0.41 1 1 1 1
x[4] 1.02 0.01 1.46 0 0.46 3.95 14416 0.36 14426 16327 16334 0.41 1 1 1 1
x[5] 0.99 0.01 1.41 0 0.44 3.80 14113 0.35 14122 16320 16334 0.41 1 1 1 1
x[6] 1.00 0.01 1.39 0 0.46 3.82 15467 0.39 15478 16725 16741 0.42 1 1 1 1
x[7] 0.99 0.01 1.39 0 0.44 3.82 15273 0.38 15291 16146 16160 0.40 1 1 1 1
x[8] 1.00 0.01 1.40 0 0.46 3.80 15859 0.40 15854 16891 16879 0.42 1 1 1 1
x[9] 1.00 0.01 1.42 0 0.45 3.84 14856 0.37 14874 16669 16685 0.42 1 1 1 1
x[10] 0.98 0.01 1.40 0 0.44 3.80 15182 0.38 15183 16894 16898 0.42 1 1 1 1
x[11] 1.01 0.01 1.40 0 0.47 3.86 14766 0.37 14764 16368 16353 0.41 1 1 1 1
x[12] 1.00 0.01 1.44 0 0.46 3.88 15039 0.38 15047 16857 16870 0.42 1 1 1 1
x[13] 1.01 0.01 1.41 0 0.46 3.84 14281 0.36 14284 16016 16048 0.40 1 1 1 1
x[14] 0.99 0.01 1.40 0 0.45 3.82 15836 0.40 15839 17373 17376 0.43 1 1 1 1
x[15] 1.01 0.01 1.44 0 0.45 3.90 15041 0.38 15042 16783 16784 0.42 1 1 1 1
x[16] 1.01 0.01 1.43 0 0.45 3.87 15063 0.38 15069 16778 16795 0.42 1 1 1 1
zfsRhat zfsseff zfsreff tailseff tailreff medsseff medsreff madsseff madsreff
x[1] 1 19185 0.48 18659 0.47 19329 0.48 23759 0.59
x[2] 1 18862 0.47 18321 0.46 19648 0.49 23880 0.60
x[3] 1 19526 0.49 18502 0.46 18845 0.47 24924 0.62
x[4] 1 17976 0.45 18983 0.47 19046 0.48 23027 0.58
x[5] 1 18613 0.47 17507 0.44 19219 0.48 23699 0.59
x[6] 1 19530 0.49 19952 0.50 19137 0.48 24473 0.61
x[7] 1 19258 0.48 19410 0.49 19190 0.48 24176 0.60
x[8] 1 19610 0.49 19856 0.50 19292 0.48 24890 0.62
x[9] 1 19020 0.48 18873 0.47 19667 0.49 23809 0.60
x[10] 1 19825 0.50 18605 0.47 19514 0.49 24381 0.61
x[11] 1 18992 0.47 18630 0.47 18618 0.47 24285 0.61
x[12] 1 19392 0.48 18740 0.47 19383 0.48 25018 0.63
x[13] 1 19233 0.48 18402 0.46 19707 0.49 24289 0.61
x[14] 1 19916 0.50 19872 0.50 20016 0.50 24604 0.62
x[15] 1 19255 0.48 18286 0.46 19637 0.49 24952 0.62
x[16] 1 19579 0.49 19505 0.49 19718 0.49 24354 0.61
The mean of the bulk-ESS for \(x_j^2\) is 1.65454410^{4}, which is quite close to the bulk-ESS for lp__. This is not that surprising as the potential energy in normal model is proportional to \(\sum_{j=1}^J x_j^2\).
Let’s check the effective sample size in different parts of the posterior by computing the effective sample size for small interval probability estimates for x[1]^2.
plot_local_ess(fit = samp_x2, par = 1, nalpha = 100)
The effective sample size is mostly a bit below 1, but for the right tail of \(x_1^2\) the effective sample size drops. This is likely due to only some iterations having high enough total energy to obtain draws from the high energy part of the tail. Let’s check the effective sample size for quantiles.
plot_quantile_ess(fit = samp_x2, par = 1, nalpha = 100)
We can see the correlation between lp__ and magnitude of x[1] in the following plot.
samp <- as.array(fit_n)
qplot(
as.vector(samp[, , "lp__"]),
abs(as.vector(samp[, , "x[1]"]))
) +
labs(x = 'lp__', y = 'x[1]')
Low lp__ values corresponds to high energy and more variation in x[1], and high lp__ corresponds to low energy and small variation in x[1]. Finally \(\sum_{j=1}^J x_j^2\) is perfectly correlated with lp__.
qplot(
as.vector(samp[, , "lp__"]),
as.vector(apply(samp[, , 1:16]^2, 1:2, sum))
) +
labs(x = 'lp__', y = 'sum(x^2)')
This shows that even if we get high effective sample size estimates for central quantities (like mean or median), it is important to look at the relative efficiency of scale and tail quantities, as well. The effective sample size of lp__ can also indicate problems of sampling in the tails.
makevars <- file.path(Sys.getenv("HOME"), ".R/Makevars")
if (file.exists(makevars)) {
writeLines(readLines(makevars))
}
CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function
CXXFLAGS+=-flto -ffat-lto-objects -Wno-unused-local-typedefs
CXXFLAGS+=-std=c++11
CFLAGS+=-O3
devtools::session_info("rstan")
─ Session info ───────────────────────────────────────────────────────────────────────────────────
setting value
version R version 3.5.1 (2018-07-02)
os Ubuntu 16.04.6 LTS
system x86_64, linux-gnu
ui X11
language en_GB:en
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Helsinki
date 2019-03-14
─ Packages ───────────────────────────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.1)
backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.1)
BH 1.69.0-1 2019-01-07 [1] CRAN (R 3.5.1)
callr 3.1.1 2018-12-21 [1] CRAN (R 3.5.1)
checkmate 1.9.1 2019-01-15 [1] CRAN (R 3.5.1)
cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.1)
colorspace 1.4-0 2019-01-13 [1] CRAN (R 3.5.1)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.1)
desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.1)
digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.1)
fansi 0.4.0 2018-10-05 [1] CRAN (R 3.5.1)
ggplot2 * 3.1.0 2018-10-25 [1] CRAN (R 3.5.1)
glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.1)
gridExtra * 2.3 2017-09-09 [1] CRAN (R 3.5.1)
gtable 0.2.0 2016-02-26 [1] CRAN (R 3.5.1)
inline 0.3.15 2018-05-18 [1] CRAN (R 3.5.1)
labeling 0.3 2014-08-23 [1] CRAN (R 3.5.1)
lattice 0.20-38 2018-11-04 [1] CRAN (R 3.5.1)
lazyeval 0.2.1 2017-10-29 [1] CRAN (R 3.5.1)
loo 2.0.0.9000 2019-03-11 [1] Github (stan-dev/loo@f920d8c)
magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.1)
MASS 7.3-51.1 2018-11-01 [1] CRAN (R 3.5.1)
Matrix 1.2-16 2019-03-08 [1] CRAN (R 3.5.1)
matrixStats 0.54.0 2018-07-23 [1] CRAN (R 3.5.1)
mgcv 1.8-27 2019-02-06 [1] CRAN (R 3.5.1)
munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.1)
nlme 3.1-137 2018-04-07 [1] CRAN (R 3.5.1)
pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.1)
pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.1)
pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.1)
plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.1)
prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.1)
processx 3.3.0 2019-03-10 [1] CRAN (R 3.5.1)
ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.1)
R6 2.4.0 2019-02-14 [1] CRAN (R 3.5.1)
RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 3.5.1)
Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.1)
RcppEigen 0.3.3.5.0 2018-11-24 [1] CRAN (R 3.5.1)
reshape2 1.4.3 2017-12-11 [1] CRAN (R 3.5.1)
rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.1)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.1)
rstan * 2.18.2 2018-11-07 [1] CRAN (R 3.5.1)
scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.1)
StanHeaders * 2.18.1 2019-01-28 [1] CRAN (R 3.5.1)
stringi 1.3.1 2019-02-13 [1] CRAN (R 3.5.1)
stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.5.1)
tibble * 2.0.1 2019-01-12 [1] CRAN (R 3.5.1)
utf8 1.1.4 2018-05-24 [1] CRAN (R 3.5.1)
viridisLite 0.3.0 2018-02-01 [1] CRAN (R 3.5.1)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.1)
[1] /u/77/ave/unix/R/x86_64-pc-linux-gnu-library/3.5
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library